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ABSTRACT 

We discuss the cosmological simulation code GADGET-2, a new massively parallel 
TreeSPH code, capable of following a collisionless fluid with the N-body method, and 
an ideal gas by means of smoothed particle hydrodynamics (SPH). Our implementa- 
tion of SPH manifestly conserves energy and entropy in regions free of dissipation, 
while allowing for fully adaptive smoothing lengths. Gravitational forces are com- 
puted with a hierarchical multipole expansion, which can optionally be applied in the 
form of a TreePM algorithm, where only short-range forces are computed with the 
'tree'-method while long-range forces are determined with Fourier techniques. Time 
integration is based on a quasi-symplectic scheme where long-range and short-range 
forces can be integrated with different timesteps. Individual and adaptive short-range 
timesteps may also be employed. The domain decomposition used in the parallelisa- 
tion algorithm is based on a space-filling curve, resulting in high flexibility and tree 
force errors that do not depend on the way the domains are cut. The code is efficient 
in terms of memory consumption and required communication bandwidth. It has been 
used to compute the first cosmological N-body simulation with more than 10 10 dark 
matter particles, reaching a homogeneous spatial dynamic range of 10 5 per dimension 
in a 3D box. It has also been used to carry out very large cosmological SPH simu- 
lations that account for radiative cooling and star formation, reaching total particle 
numbers of more than 250 million. We present the algorithms used by the code and 
discuss their accuracy and performance using a number of test problems. GADGET-2 
is publicly released to the research community. 
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1 INTRODUCTION 

Cosmological simulations play an ever more important role 
in theoretical studies of the structure formation process in 
the Universe. Without numerical simulations, the ACDM 
model may arguably not have developed into the leading 
theoretical paradigm for structure formation which it is to- 
day. This is because direct simulation is often the only avail- 
able tool to compute accurate theoretical predictions in the 
highly non-linear regime of gravitational dynamics and hy- 
drodynamics. This is particularly true for the hierarchical 
structure formation process with its inherently complex ge- 
ometry and three-dimensional dynamics. 

The list of important theoretical cosmological results 
based on simulation work is therefore quite long, including 
fundamental results such as the density profiles of dark mat- 
ter halos (e.g. Navarro et al., 1996), the existence and dy- 
namics of dark matter substructure (e.g. Tormen, 1997), the 
non-linear clustering properties of dark matter (e.g. Jenkins 
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et al., 1998), the halo abundance (e.g. Jenkins et al., 2001), 
the temperature and gas profiles of clusters of galaxies (e.g. 
Evrard, 1990) , or the detailed properties of Lyman-a absorp- 
tion lines in the interstellar medium (e.g. Hernquist et al., 
1996). Given that many astrophysical phenomena involve 
a complex interplay of physical processes on a wide range 
of scales, it seems clear that the importance of simulation 
methods will continue to grow. This development is further 
fueled by the rapid progress in computer technology, which 
makes an ever larger dynamic range accessible to simulation 
models. However, powerful computer hardware is only one 
requirement for research with numerical simulations. The 
other, equally important one, lies in the availability of suit- 
able numerical algorithms and simulation codes, capable of 
efficiently exploiting available computers to study physical 
problems of interest, ideally in a highly accurate and flexible 
way, so that new physics can be introduced easily. 

This paper is about a novel version of the simulation 
code GADGET, which was written and publicly released in 
its original form four years ago (Springel et al., 2001), af- 
ter which it found widespread use in the research of many 
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simulation groups. The code discussed here has principal ca- 
pabilities similar to the original GADGET code. It can evolve 
all the systems (plus a number of additional ones) that the 
first version could, but it does this more accurately, and 
substantially faster. It is also more flexible, more memory 
efficient, and more easily extendible, making it considerably 
more versatile. These improvements can be exploited for 
more advanced simulations and demonstrate that progress 
in algorithmic methods can be as important, or sometimes 
even more important, than the performance increase offered 
by new generations of computers. 

The principal structure of GADGET-2 is that of a 
TreeSPH code (Hernquist & Katz, 1989), where gravita- 
tional interactions are computed with a hierarchical multi- 
pole expansion, and gas dynamics is followed with smoothed 
particle hydrodynamics (SPH). Gas and collisionless dark 
matter* are both represented by particles in this scheme. 
Note that while there is a large variety of techniques for 
computing the gravitational field, the basic N-body method 
for representing a collisionless fluid is the same in all cos- 
mological codes, so that they ultimately only differ in the 
errors with which they approximate the gravitational field. 

Particle-mesh (PM) methods (e.g. Klypin & Shandarin, 
1983; White et al., 1983) are the fastest schemes for com- 
puting the gravitational field, but for scales below 1-2 mesh 
cells, the force is heavily suppressed, as a result this tech- 
nique is not well suited for work with high spatial resolution. 
The spatial resolution can be greatly increased by adding 
short-range direct-summation forces (Hockney & Eastwood, 
1981; Efstathiou et al., 1985), or by using additional Fourier- 
meshes adaptively placed on regions of interest (Couch- 
man, 1991). The mesh can also be adaptively refined, with 
the potential found in real space using relaxation methods 
(Kravtsov et al, 1997; Knebe et al., 2001). 

The hierarchical tree algorithms (Appel, 1985; Barnes 
& Hut, 1986; Dehnen, 2000) follow a different approach, and 
have no intrinsic resolution limit. Particularly for mass dis- 
tributions with low density contrast, they can however be 
substantially slower than Fourier-based methods. The recent 
development of TreePM hybrid methods (Xu, 1995) tries to 
combine the best of both worlds by restricting the tree algo- 
rithm to short-range scales, while computing the long-range 
gravitational force by means of a particle-mesh algorithm. 
GADGET-2 offers this method as well. 

Compared to gravity, much larger conceptual differ- 
ences exist between the different hydrodynamical methods 
employed in current cosmological codes. Traditional 'Eule- 
rian' methods discretise space and represent fluid variables 
on a mesh, while 'Lagrangian' methods discretise mass, us- 
ing, for example, a set of fluid particles to model the flow. 
Both methods have found widespread application in cosmol- 
ogy. Mesh-based codes include algorithms with a fixed mesh 
(e.g. Cen & Ostriker, 1992, 1993; Yepes et al., 1995; Pen, 
1998), and more recently also with adaptive meshes (e.g. 
Bryan & Norman, 1997; Norman & Bryan, 1999; Teyssier, 
2002; Kravtsov et al., 2002; Quilis, 2004). Lagrangian codes 
have almost all employed SPH thus far (e.g. Evrard, 1988; 
Hernquist & Katz, 1989; Navarro & White, 1993; Katz et al., 
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1996; Couchman et al., 1995; Steinmetz, 1996; Serna et al., 
1996; Dave et al., 1997; Tissera et al., 1997; Owen et al., 
1998; Serna et al., 2003; Wadsley et al., 2004), although this 
is not the only possibility (Gnedin, 1995; Whitehurst, 1995). 

Mesh-codes offer superior resolving power for hydrody- 
namical shocks, with some methods being able to capture 
shocks without artifical viscosity, and with very low residual 
numerical viscosity. However, static meshes are only poorly 
suited for the high dynamic range encountered in cosmol- 
ogy. Even for meshes as large as 1024 3 , which is a chal- 
lenge at present (e.g. Kang et al., 2005; Cen et al., 2003), 
individual galaxies in a cosmological volume are poorly re- 
solved, leaving no room for resolving internal structure such 
as bulge and disk components. A potential solution is pro- 
vided by new generations of adaptive mesh refinement codes, 
which begin to be more widely used in cosmology (e.g. Abel 
et al., 2002; Kravtsov et al., 2002; Refregier & Teyssier, 2002; 
Motl et al., 2004). Some drawbacks of the mesh remain how- 
ever even here. For example, the dynamics is in general not 
Galilean-invariant, there are advection errors, and there can 
be spurious generation of entropy due to mixing. 

In contrast, Lagrangian methods like SPH are particu- 
larly well-suited to follow the gravitational growth of struc- 
ture, and to automatically increase the resolution in the 
central regions of galactic halos, which are the regions of 
primary interest in cosmology. The accurate treatment of 
self-gravity of the gas in a fashion consistent with that of 
the dark matter is a further strength of the particle-based 
SPH method. Another fundamental difference with mesh 
based schemes is that thermodynamic quantities advected 
with the flow do not mix between different fluid elements at 
all, unless explicitly modelled. An important disadvantage 
of SPH is that the method has to rely on an artificial viscos- 
ity for supplying the necessary entropy injection in shocks. 
The latter are broadened over the SPH smoothing scale and 
not resolved as true discontinuities. 

In this paper, we give a concise description of the numer- 
ical model and the novel algorithmic methods implemented 
in GADGET-2, which may also serve as a reference for the 
publicly released version of this code. In addition we mea- 
sure the code performance and accuracy for different types 
of problems, and discuss the results obtained for a num- 
ber of test problems, focusing in particular on gasdynamical 
simulations. 

This paper is organised as follows. In Section 2, we sum- 
marise the set of equations the code integrates forward in 
time. We then discuss in Section 3 the algorithms used to 
compute the 'right-hand side' of these equations efficiently, 
i.e. the gravitational and hydrodynamical forces. This is fol- 
lowed by a discussion of the time integration scheme in Sec- 
tion 4, and an explanation of the parallelisation strategy in 
Section 5. We present results for a number of test problems 
in Section 6, followed by a discussion of code performance in 
Section 7. Finally, we summarise our findings in Section 8. 



2 BASIC EQUATIONS 

We here briefly summarise the basic set of equations that 
are studied in cosmological simulations of structure forma- 
tion. They describe the dynamics of a collisionless com- 
ponent (dark matter or stars in galaxies) and of an ideal 
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gas (ordinary baryons, mostly hydrogen and helium), both 
subject to and coupled by gravity in an expanding back- 
ground space. For brevity, we focus on the discretised forms 
of the equations, noting the simplifications that apply for 
non-expanding space where appropriate. 



2.1 Collisionless dynamics 

The continuum limit of non-interacting dark matter is de- 
scribed by the collisionless Boltzmann equation coupled to 
the Poisson equation in an expanding background Universe, 
the latter taken normally as a Friedman-Lemaitre model. 
Due to the high-dimensionality of this problem, these equa- 
tions are best solved with the N-body method, where phase- 
space density is sampled with a finite number N of tracer 
particles. 

The dynamics of these particles is then described by the 
Hamiltonian 



H 



E 



2 mi a(t)'- 



+ 



IE 



miinj ip(xi — Xj) 



(1) 



where H = H(p 1 , . . . ,p N ,xi, . . . ,XN,t). The Xi are comov- 
ing coordinate vectors, and the corresponding canonical mo- 
menta are given by p i — a 2 rriiXi. The explicit time depen- 
dence of the Hamiltonian arises from the evolution a(t) of 
the scale factor, which is given by the Friedman-Lemaitre 
model. 

If we assume periodic boundary conditions for a cube 
of size L 3 , the interaction potential ip(x) is the solution of 



V>(x) = 4ttG 



(2) 



where the sum over n = (711,712,713) extends over all in- 
teger triplets. Note that the mean density is subtracted 
here, so the solution corresponds to the peculiar potential, 
where the dynamics of the system is governed by \7 2 cj>(x) = 
4nG[p(x) — p]. For our discretised particle system, we define 
the peculiar potential as 



<f)(x) = ^ nii <p(x 



Xi). 



(3) 



The single particle density distribution function S(x) is the 
Dirac 5-function convolved with a normalised gravitational 
softening kernel of comoving scale e. For it, we employ the 
spline kernel (Monaghan & Lattanzio, 1985) used in SPH 
and set 5(x) — W(|x|,2.8e), where W(r) is given by 



W(r,k) = — 



1-6(tt) 2 +6(^) 3 , 0<£<±, 




< 1, 



(4) 



77 >1- 



For this choice, the Newtonian potential of a point mass at 
zero lag in non-periodic space is —Gm/e, the same as for a 
Plummer 'sphere' of size e. 

If desired, we can simplify to Newtonian space by set- 
ting a(t) = 1, so that the explicit time dependence of the 
Hamiltonian vanishes. For vacuum boundaries, the interac- 
tion potential simplifies to the usual Newtonian form, i.e. 
for point masses it is given by <p(x) = —G/\x\ modified by 
the softening for small separations. 



Note that independent of the type of boundary condi- 
tions, a complete force computation involves a double sum, 
resulting in a iV 2 -scaling of the computational cost. This re- 
flects the long-range nature of gravity, where each particle 
interacts with every other particle, making high-accuracy 
solutions for the gravitational forces very expensive for large 
N. Fortunately, the force accuracy needed for collisionless 
dynamics is comparatively modest. Force errors up to ~ 1 
per cent tend to only slightly increase the numerical relax- 
ation rate without compromising results (Hernquist et al., 
1993), provided the force errors are random. This allows the 
use of approximative force computations by methods such as 
those discussed in Section 3. We note however that the situ- 
ation is different for collisional N-body systems, such as star 
clusters. Here direct summation can be necessary to deliver 
the required force accuracy, a task that triggered the devel- 
opment of powerful custom-made computers such as GRAPE 
(e.g Makino, 1990; Makino et al., 2003). These systems can 
then also be applied to collisionless dynamics using a direct- 
summation approach (e.g. Steinmetz, 1996; Makino et al., 
1997), or by combining them with tree- or treePM-methods 
(Fukushige et al., 2005). 



2.2 Hydrodynamics 

Smoothed particle hydrodynamics (SPH) uses a set of dis- 
crete tracer particles to describe the state of a fluid, with 
continuous fluid quantities being defined by a kernel inter- 
polation technique (Lucy, 1977; Gingold & Monaghan, 1977; 
Monaghan, 1992). The particles with coordinates r», veloci- 
ties Vi, and masses irn are best thought of as fluid elements 
that sample the gas in a Lagrangian sense. The thermody- 
namic state of each fluid element may either be defined in 
terms of its thermal energy per unit mass, m, or in terms 
of the entropy per unit mass, Si. We prefer to use the lat- 
ter as the independent thermodynamic variable evolved in 
SPH, for reasons discussed in full detail in Springel & Hern- 
quist (2002). Our formulation of SPH manifestly conserves 
both energy and entropy even when fully adaptive smooth- 
ing lengths are used. Traditional formulations of SPH on 
the other hand can violate entropy conservation in certain 
situations. 

We begin by noting that it is more convenient to work 
with an entropic function defined by A = P/p~' , instead of 
directly using the entropy s per unit mass. Because A = A(s) 
is only a function of s for an ideal gas, we will often refer to 
A as 'entropy'. 

Of fundamental importance for any SPH formulation is 
the density estimate, which GADGET-2 does in the form 



Pi = "yimjWQrglhi), 



(5) 



where nj = r% — rj, and W(r,h) is the SPH smoothing 
kernel defined in equation (4) In our 'entropy formulation' 



' We note that most of the literature on SPH defines the smooth- 
ing length such that the kernel drops to zero at a distance 2h, and 
not at h as we have chosen here for consistency with Springel et al. 
(2001). This is only a difference in notation without bearing on 
the results. 
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of SPH, the adaptive smoothing lengths hi of each particle 
are defined such that their kernel volumes contain a constant 
mass for the estimated density, i.e. the smoothing lengths 
and the estimated densities obey the (implicit) equations 



4tt 3 

Pi 



N, 



sphm, 



(6) 



where iV sp h is the typical number of smoothing neighbours, 
and m is an average particle mass. Note that in many other 
formulations of SPH, smoothing lengths are typically chosen 
such that the number of particles inside the smoothing radius 
hi is nearly equal to a constant value iV sp h. 

Starting from a discretised version of the fluid La- 
grangian, one can show (Springel & Hernquist, 2002) that 
the equations of motion for the SPH particles are given by 



= -E' 



3 = 1 



fi^ViWaihi) + f^v.WiAhi) 

. Pi Pi 



where the coefficients /, are defined by 



fi = 



1 + 



hi dpi 
3pi dhi 



, (7) 



(8) 



and the abbreviation Wij(h) = W^dr, — rj\,h) has been 
used. The particle pressures are given by Pi = AipJ . Pro- 
vided there are no shocks and no external sources of heat, the 
equations above already fully define reversible fluid dynam- 
ics in SPH. The entropy At of each particle stays constant 
in such a flow. 

However, flows of ideal gases can easily develop discon- 
tinuities, where entropy is generated by microphysics. Such 
shocks need to be captured by an artificial viscosity in SPH. 
To this end, GADGET-2 uses a viscous force 



dvi 



(9) 



3 = 1 



where Ylij > is non-zero only when particles approach each 
other in physical space. The viscosity generates entropy at 
a rate 



(10) 



3 = 1 



transforming kinetic energy of gas motion irreversibly into 
heat. The symbol Wij is here the arithmetic average of the 
two kernels Wij (hi) and Wij(hj). 

The Monaghan-Balsara form of the artificial viscosity 
(Monaghan & Gingold, 1983; Balsara, 1995) is probably the 
most widely employed parameterisation of the viscosity in 
SPH codes. It takes the form 



n, : 



with 



Hij — 



[-acij/Mj + Pp%]/p tJ if Vij ■ nj < 



otherwise, 



(11) 



(12) 



Here hij and pij denote arithmetic means of the correspond- 
ing quantities for the two particles i and j, with dj giving 
the mean sound speed. The strength of the viscosity is regu- 
lated by the parameters a and (3, with typical values in the 
range a ~ 0.5 — 1.0 and the frequent choice of (3 = 2 a. 



Based on an analogy with the Riemann problem and 
using the notion of a signal velocity between two par- 
ticles, Monaghan (1997) derived a slightly modified pa- 
rameterisation of the viscosity, namely the ansatz Yiij — 
— ^WijV^/pij. In the simplest form, the signal velocity can 
be estimated as 

v*f = Ci +Cj - 3wij, (13) 

where u>ij = Vij ■ Vij/\rij\ is the relative velocity projected 
onto the separation vector, provided the particles approach 
each other, i.e. for Vij ■ rij < 0, otherwise we set Wij = 0. 
This gives a viscosity of the form 

Ilij = — ^' + Cj - 3Wij ^ Wij (14) 

which is identical to (11) if one sets /3 = 3/2 a and replaces 
Wij with pij. The main difference between the two viscosi- 
ties lies therefore in the additional factor of hij/rij that pij 
carries with respect to Wij. In equations (11) and (12), this 
factor weights the viscous force towards particle pairs with 
small separations. In fact, after multiplying with the ker- 
nel derivative, this weighting is strong enough to make the 
viscous force diverge as oc 1/rij for small pair separations, 
unless p^ in equation (12) is softened at small separations 
by adding some fraction of hfj in the denominator, as it is 
often done in practice. 

In the equation of motion, the viscosity acts like an 
excess pressure P v i S c — \pijHij assigned to the particles. 
For the new form (14) of the viscosity, this is given by 



Ptho 



(15) 



assuming roughly equal sound speeds and densities of the 
two particles for the moment. This viscous pressure depends 
only on a Mach- number like quantity w /c, and not explicitly 
on the particle separation or smoothing length. We found 
that the modified viscosity (14) gives equivalent or improved 
results in our tests compared to the standard formulation 
of equation (11). In simulations with dissipation it has the 
advantage that the occurrence of very large viscous acceler- 
ations is reduced, so that a more efficient and stable time 
integration results. For these reason, we usually adopt the 
viscosity (14) in GADGET-2. 

The signal-velocity approach naturally leads to a 
Courant-like hydrodynamical timestep of the form 

Ccourant hi . . 

(16) 



At 



(hyd) 



maxj(ci + Cj — 3u>ij) 

which is adopted by GADGET-2. The maximum is here de- 
termined with respect to all neighbours j of particle i. 

Following Balsara (1995) and Steinmetz (1996), 
GADGET-2 also uses an additional viscosity-limiter to alle- 
viate spurious angular momentum transport in the presence 
of shear flows. This is done by multiplying the viscous tensor 
with (ft + fj)/2, where 

IV x vU 



fi = 



(17) 



|V ■ v\i + |V x v\i 

is a simple measure for the relative amount of shear in the 
flow around particle i, based on standard SPH estimates for 
divergence and curl (Monaghan, 1992). 

The above equations for the hydrodynamics were all 



© 0000 RAS, MNRAS 000, 000-000 



The cosmological simulation code GADGET-2 5 



expressed using physical coordinates and velocities. In the 
actual simulation code, we use comoving coordinates x, co- 
moving momenta p and comoving densities as internal com- 
putational variables, which are related to physical variables 
in the usual way. Since we continue to use the physical en- 
tropy, adiabatic cooling due to expansion of the universe is 
automatically treated accurately. 

2.3 Additional physics 

A number of further physical processes have already been 
implemented in GADGET-2, and were applied to study struc- 
ture formation problems. A full discussion of this physics 
(which is not included in the public release of the code) is 
beyond the scope of this paper. However, we here give a brief 
overview of what has been done so far and refer the reader 
to the cited papers for physical results and technical details. 

Radiative cooling and heating by photoionisation has 
been implemented in GADGET-2 in a similar way as in Katz 
et al. (1996), i.e. the ionisation balance of helium and hydro- 
gen is computed in the presence of an externally specified 
time-dependent UV background under the assumption of 
collisional ionisation equilibrium. Yoshida et al. (2003) re- 
cently added a network for the nonequilibrium treatment of 
the primordial chemistry of nine species, allowing cooling by 
molecular hydrogen to be properly followed. 

Star formation and associated feedback processes have 
been modelled with GADGET by a number of authors using 
different physical approximations. Springel (2000) consid- 
ered a feedback model based on a simple turbulent pressure 
term, while Kay (2004) studied thermal feedback with de- 
layed cooling. A related model was also implemented by Cox 
et al. (2004). Springel & Hernquist (2003a, b) implemented a 
subresolution multiphase model for the treatment of a star- 
forming interstellar medium. Their model also accounts for 
energetic feedback by galactic winds, and includes a basic 
treatment of metal enrichment. More detailed metal enrich- 
ment models that allow for separate treatments of type-II 
and type-I supernovae while also properly accounting for the 
lifetimes of different stellar populations have been indepen- 
dently implemented by Tornatore et al. (2004) and Scanna- 
pieco et al. (2005). A different, more explicit approach to 
describe a multiphase ISM has been followed by Marri & 
White (2003), who introduced a hydrodynamic decoupling 
of cold gas and the ambient hot phase. A number of studies 
also used more ad-hoc models of feedback in the form of pre- 
heating prescriptions (Springel et al., 2001; van den Bosch 
et al., 2003; Tornatore et al., 2003). 

A treatment of thermal conduction in hot ionised gas 
has been implemented by Jubelgas et al. (2004) and was 
used to study modifications of the intracluster medium 
of rich clusters of galaxies (Dolag et al., 2004c) caused 
by conduction. An SPH approximation of ideal magneto- 
hydrodynamics has been added to GADGET-2 and was used 
to study deflections of ultra-high energy cosmic rays the Lo- 
cal Universe Dolag et al. (2004b, 2005). 

Di Matteo et al. (2005) and Springel et al. (2005) intro- 
duced a model for the growth of supermassive black holes 
at the centres of galaxies, and studied how energy feedback 
from gas accretion onto a supermassive black hole regulates 
quasar activity and nuclear star formation. Cuadra et al. 
(2005) added the ability to model stellar winds and studied 



the feeding of Sgr A* by the stars orbiting in the vicinity of 
the centre of the Galaxy. 

Finally, non-standard dark matter dynamics has also 
been investigated with GADGET. Linder & Jenkins (2003) 
and Dolag et al. (2004a) independently studied dark en- 
ergy cosmologies. Also, both Yoshida et al. (2000) and Dave 
et al. (2001) studied halo formation with self-interacting 
dark matter, modelled by explicitly introducing collisional 
terms for the dark matter particles. 



3 GRAVITATIONAL ALGORITHMS 

Gravity is the driving force of structure formation. Its com- 
putation thus forms the core of any cosmological code. Un- 
fortunately, its long-range nature, and the high-dynamic 
range posed by the structure formation problem, make it 
particularly challenging to compute the gravitational forces 
accurately and efficiently. In the GADGET-2 code, both the 
collisionless dark matter and the gaseous fluid are repre- 
sented as particles, allowing the self-gravity of both compo- 
nents to be computed with gravitational N-body methods, 
which we discuss next. 



3.1 The tree algorithm 

The primary method that GADGET-2 uses to achieve the 
required spatial adaptivity is a hierarchical multipole ex- 
pansion, commonly called a tree algorithm. These methods 
group distant particles into ever larger cells, allowing their 
gravity to be accounted for by means of a single multipole 
force. Instead of requiring N — 1 partial forces per particle 
as needed in a direct-summation approach, the gravitational 
force on a single particle can then be computed with just 
G(\ogN) interactions. 

In practice, the hierarchical grouping that forms the ba- 
sis of the multipole expansion is most commonly obtained by 
a recursive subdivision of space. In the approach of Barnes 
& Hut (1986, BH hereafter), a cubical root node is used to 
encompass the full mass distribution, which is repeatedly 
subdivided into eight daughter nodes of half the side-length 
each, until one ends up with 'leaf nodes containing single 
particles. Forces are then obtained by "walking" the tree, 
i.e. starting at the root node, a decision is made whether or 
not the multipole expansion of the node is considered to pro- 
vide an accurate enough partial force (which will in general 
be the case for nodes that are small and distant enough). If 
the answer is 'yes', the multipole force is used and the walk 
along this branch of the tree can be terminated, if it is 'no', 
the node is "opened", i.e. its daughter nodes are considered 
in turn. 

It should be noted that the final result of the tree al- 
gorithm will in general only represent an approximation to 
the true force. However, the error can be controlled conve- 
niently by modifying the opening criterion for tree nodes, 
because higher accuracy is obtained by walking the tree to 
lower levels. Provided sufficient computational resources are 
invested, the tree force can then be made arbitrarily close 
to the well-specified correct force. 
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3.1.1 Details of the tree code 

There are three important characteristics of a gravitational 
tree code: the type of grouping employed, the order chosen 
for the multipole expansion, and the opening criterion used. 
As a grouping algorithm, we prefer the geometrical BH oct- 
tree instead of alternatives such as those based on nearest- 
neighbour pairings (Jernigan & Porter, 1989) or a binary 
kD-tree Stadel (2001). The oct-tree is 'shallower' than the 
binary tree, i.e. fewer internal nodes are needed for a given 
number iV of particles. In fact, for a nearly homogeneous 
mass distribution, only ~ 0.3 N internal nodes are needed, 
while for a heavily clustered mass distribution in a cosmo- 
logical simulation, this number tends to increase to about 
w 0.65 N, which is still considerably smaller than the num- 
ber of « N required in the binary tree. This has advantages 
in terms of memory consumption. Also, the oct-tree avoids 
problems due to large aspect ratios of nodes, which helps to 
keep the magnitude of higher order multipoles small. The 
clean geometric properties of the oct-tree make it ideal for 
use as a range-searching tool, a further application of the 
tree we need for finding SPH neighbours. Finally, the ge- 
ometry of the oct-tree has a close correspondence with a 
space-filling Peano-Hilbert curve, a fact we exploit for our 
parallelisation algorithm. 

With respect to the multipole order, we follow a dif- 
ferent approach than used in GADGET- 1, where an expan- 
sion including octopole moments was employed. Studies by 
Hernquist (1987) and Barnes & Hut (1989) indicate that 
the use of quadrupole moments may increase the efficiency 
of the tree algorithm in some situations, and Wadsley et al. 
(2004) even advocate hexadecopole order as an optimum 
choice. Higher order typically allows larger cell-opening an- 
gles, i.e. for a desired accuracy fewer interactions need to 
be evaluated. This advantage is partially compensated by 
the increased complexity per evaluation and the higher tree 
construction and tree storage overhead, such that the per- 
formance as a function of multipole order forms a broad 
maximum, where the precise location of the optimum may 
depend sensitively on fine details of the software implemen- 
tation of the tree algorithm. 

In GADGET-2, we deliberately went back to monopole 
moments, because they feature a number of distinct ad- 
vantages which make them very attractive compared to 
schemes that carry the expansions to higher order. First 
of all, gravitational oct-trees with monopole moments can 
be constructed in an extremely memory efficient way. In the 
first stage of our tree construction, particles are inserted one 
by one into the tree, with each internal node holding stor- 
age for indices of 8 daughter nodes or particles. Note that 
for leaves themselves, no node needs to be stored. In a sec- 
ond step, we compute the multipole moments recursively by 
walking the tree once in full. It is interesting to note that 
these 8 indices will not be needed anymore in the actual 
tree walk - all that is needed for each internal node is the 
information which node would be the next one to look at 
in case the node needs to be opened, or alternatively, which 
is the next node in line in case the multipole moment of 
the node can be used. We can hence reuse the memory used 
for the 8 indices and store in it the two indices needed for 
the tree walk, plus the multipole moment, which in the case 
of monopole moments is the mass and and the centre-of- 



mass coordinate vector. We additionally need the node side 
length, which adds up to 7 variables, leaving one variable 
still free, which we use however in our parallelisation strat- 
egy. In any case, this method of constructing the tree at no 
time requires more than ~ 0.65 x 8 x 4 ~ 21 bytes per par- 
ticle (assuming 4 bytes per variable), for a fully threaded 
tree. This compares favourably with memory consumptions 
quoted by other authors, even compared with the storage op- 
timised tree construction schemes of Dubinski et al. (2004) , 
where the tree is only constructed for part of the volume at 
a given time, or with the method of Wadsley et al. (2004), 
where particles are bunched into groups, reducing the num- 
ber of internal tree nodes by collapsing ends of trees into 
nodes. Note also that the memory consumption of our tree 
is lower than required for just storing the phase-space vari- 
ables of particles, leaving aside additional variables that are 
typically required to control time-stepping, or to label the 
particles with an identifying number. In the standard version 
of GADGET-2, we do not quite realize this optimum because 
we also store the geometric centre of each tree in order to 
simplify the SPH neighbour search. This can in principle be 
omitted for purely collisionless simulations. 

Very compact tree nodes as obtained above are also 
highly advantageous given the architecture of modern pro- 
cessors, which typically feature several layers of fast 'cache'- 
memory as work-space. Computations that involve data that 
is already in cache can be carried out with close to max- 
imum performance, but access to the comparatively slow 
main memory imposes large numbers of wait cycles. Small 
tree nodes thus help to make better use of the available 
memory bandwidth, which is often a primary factor limit- 
ing the sustained performance of a processor. By ordering 
the nodes in main memory in a special way (see Section 5.1), 
we can in addition help the processor and optimise its cache 
utilisation. 

Finally, a further important advantage of monopole mo- 
ments is that they allow simple dynamical tree updates 
that are consistent with the time integration scheme dis- 
cussed in detail in Section 4. GADGET-1 already allowed 
dynamic tree updates, but it neglected the time variation 
of the quadrupole moments. This introduced a time asym- 
metry, which had the potential to introduce secular inte- 
gration errors in certain situations. Note that particularly 
in simulations with radiative cooling, the dynamic range of 
timesteps can easily become so large that the tree construc- 
tion overhead would become dominant unless such dynamic 
tree update methods can be used. 

With respect to the cell-opening criterion, we usually 
employ a relative opening criterion similar to that used in 
GADGET-1, but adjusted to our use of monopole moments. 
Specifically, we consider a node of mass M and extension I 
at distance r for usage if 

— {-) ^ Q l a l> ( 18 ) 

where |a| is the size of the total acceleration obtained in the 
last timestep, and a is a tolerance parameter. This criterion 
tries to limit the absolute force error introduced in each 
particle-node interaction by comparing a rough estimate of 
the truncation error with the size of the total expected force. 
As a result, the typical relative force error is kept roughly 
constant, and if needed, the opening criterion adjusts to 
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Figure 1. Force-errors of the tree code for an isolated galaxy, 
consisting of a dark halo and a stellar disk. In the top panel, 
each line shows the fraction of particles with force errors larger 
than a given value. The different line-styles arc for different cell- 
opening criteria: the relative criterion is shown as solid lines and 
the standard BH criterion as dot-dashed lines. Both are shown for 
different values of the corresponding tolerance parameters, taken 
from the set {0.0005,0.001,0.0025,0.005,0.01,0.02} for a in the 
case of the relative criterion, and from {0.3, 0.4, 0.5, 0.6, 0.7, 0.8} 
in the case of the opening angle 6 used in the BH-criterion. In 
the lower panel, we compare the computational cost as a func- 
tion of force accuracy. Solid lines compare the force accuracy of 
the 99.9% percentile as a function of computational cost for the 
relative criterion (triangles) and the BH criterion (boxes). At the 
same computational cost, the relative criterion always delivers 
somewhat more accurate forces. The dotted lines show the cor- 
responding comparison for the 50% percentile of the force error 
distribution. 



the dynamical state of the simulation to achieve this goal; 
at high redshift, where peculiar accelerations largely cancel 
out, the average opening angles are very small, while they 
can become larger once matter clusters. Also, the opening 
angle varies with the distance of the node. The net result is 
an opening criterion that typically delivers higher force ac- 
curacy at a given computational cost compared to a purely 
geometrical criterion such as that of BH. In Figure 1, we 
demonstrate this explicitly with measurements of the force 
accuracy in a galaxy collision simulation. Note that for the 
first force computation, an estimate of the size of the force 
from the previous timestep is not yet available. We then use 
the ordinary BH opening criterion to obtain such estimates, 
followed by a recomputation of the forces in order to have 
consistent accuracy for all steps. 

Salmon & Warren (1994) pointed out that tree codes 
can produce rather large worst-case force errors when stan- 
dard opening criteria with commonly employed opening an- 
gles are used. These errors originate in situations where the 
distance to the nearest particle in the node becomes very 
small. When getting very close or within the node, the error 
can even become unbounded. Our relative opening crite- 
rion (18) may suffer such errors since one may in principle 
encounter a situation where a particle falls inside a node 
while still satisfying the cell-acceptance criterion. To protect 
against this possibility, we impose an additional opening cri- 
terion, viz. 



\r k -c k \ < 0.61. 



(19) 



Here c is the geometric centre of the node, r is the parti- 
cle coordinate, and the inequality applies for each coordinate 
axis k separately. We hence require that the particle lies out- 
side a box about 20% larger than the tree node. Tests have 
shown that this robustly protects against the occurrence of 
pathologically large force errors while incurring an accept- 
ably small increase in the average cost of the tree walk. 

3.1.2 Neighbour search using the tree 

We also use the BH oct-tree for the search of SPH neigh- 
bours, following the range-search method of Hernquist & 
Katz (1989). For a given spherical search region of radius 
hi around a target location n, we walk the tree with an 
opening criterion that examines whether there is any geo- 
metric overlap between the current tree node and the search 
region. If yes, the daughter nodes of the node are considered 
in turn. Otherwise the walk along this branch of the tree is 
immediately discarded. The tree walk is hence restricted to 
the region local to the target location, allowing an efficient 
retrieval of the desired neighbours. This use of the tree as a 
hierarchical search grid makes the method extremely flexible 
and insensitive in performance to particle clustering. 

A difficulty arises for the SPH force loop, where the 
neighbour search depends not only on hi, but also on prop- 
erties of the target particles. We here need to find all pairs 
with distances \n — rj\ < max(hi,hj), including those where 
the distance is smaller than hj but not smaller than hi. We 
solve this issue by storing in each tree node the maximum 
SPH smoothing length occuring among all particles repre- 
sented by the node. Note that we update these values consis- 
tently when the SPH smoothing lengths are redetermined in 
the first part of the SPH computation (i.e. the density loop). 
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Using this information, it is straightforward to modify the 
opening criterion such that all interacting pairs in the SPH 
force computation are always correctly found. 

Finally, a few notes on how we solve the implicit equa- 
tion (6) for determining the desired SPH smoothing lengths 
of each particle in the first place. For simplicity and for al- 
lowing a straightforward integration into our parallel com- 
munication strategy, we find the root of this equation with 
a binary bisection method. Convergence is significantly ac- 
celerated by choosing a Newton-Raphson value as the next 
guess instead of the the mid-point of the current interval. 
Given that we compute dpi/dhi anyway for our SPH formu- 
lation, this comes at no additional cost. Likewise, for each 
new timestep, we start the iteration with a new guess for hi 
based on the expected change from the velocity divergence 
of the flow. Because we usually only require that equation 
(6) is solved to a few per cent accuracy, finding and adjust- 
ing the SPH smoothing lengths is a subdominant task in the 
CPU time consumption of our SPH code. 



3.1.3 Periodic boundaries in the tree code 

The summation over the infinite grid of particle images re- 
quired for simulations with periodic boundary conditions 
can also be treated in the tree algorithm. GADGET-2 uses 
the technique proposed by Hernquist et al. (1991) for this 
purpose. The global BH-tree is only constructed for the pri- 
mary mass distribution, but it is walked such that each node 
is periodically mapped to the closest image as seen from the 
coordinate under consideration. This accounts for the dom- 
inant forces of the nearest images. For each of the partial 
forces, the Ewald summation method can be used to com- 
plement the force exerted by the nearest image with the 
contribution of all other images of the fiducial infinite grid 
of nodes. In practice, GADGET-2 uses a 3D lookup table (in 
one octant of the simulation box) for the Ewald correction, 
as proposed by Hernquist et al. (1991). 

In the first version of our code, we carried out the Ewald 
correction for each of the nodes visited in the primary tree 
walk over nearest node images, leading to roughly a dou- 
bling of the computational cost. However, the sizes of Ewald 
force correction terms have a very different distance depen- 
dence than the ordinary Newtonian forces of tree nodes. 
For nodes in the vicinity of a target particle, i.e. for sep- 
arations small against the boxsize, the correction forces are 
negligibly small, while for separations approaching half the 
boxsize they become large, eventually even cancelling the 
Newtonian force. In principle, therefore, the Ewald correc- 
tion only needs to be evaluated for distant nodes with the 
same opening criterion as the ordinary Newtonian force, 
while for nearby ones, a coarser opening angle can be chosen. 
In GADGET-2 we take advantage of this and carry out the 
Ewald corrections in a separate tree walk, taking the above 
considerations into account. This leads to a significant re- 
duction of the overhead incurred by the periodic boundaries. 



3.2 The TreePM method 

The new version of GADGET-2 used in this study option- 
ally allows the pure tree algorithm to be replaced by a hy- 
brid method consisting of a synthesis of the particle-mesh 
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Figure 2. Force decomposition and force error of the TreePM 
scheme. The top panel illustrates the size of the short-range (dot- 
dashed) and long-range force (solid) as a function of distance in 
a periodic box. The spatial scale r s of the split is marked with 
a vertical dashed line. The bottom panel compares the TreePM 
force with the exact force expected in a periodic box. For separa- 
tions of order the mesh scale (marked by a vertical dotted line), 
maximum force errors of 1 — 2 per cent due to the mesh anisotropy 
arise, but the rms force error is well below 1 per cent even in this 
range, and the mean force tracks the correct result accurately. If 
a larger force-split scale is chosen, the residual force anisotropics 
can be further reduced, if desired. 



method and the tree algorithm. GADGET-2's mathematical 
implementation of this so-called TreePM method (Xu, 1995; 
Bode et al., 2000; Bagla, 2002) is similar to that of Bagla 
& Ray (2003). The potential of Eqn. (3) is explicitly split 
in Fourier space into a long-range and a short-range part 
according to fc = (f> l ° ng + <^ hort , where 



/long 



= 4>k exp(-fc 2 r^), 



(20) 



with r s describing the spatial scale of the force-split. This 
long range potential can be computed very efficiently with 
mesh-based Fourier methods. Note that if r s is chosen 
slightly larger than the mesh scale, force anisotropies that 
exist in plain PM methods can be suppressed to essentially 
arbitrarily small levels. 

The short range part of the potential can be solved in 
real space by noting that for r s <§C L the short-range part of 
the real-space solution of Eqn. (3) is given by 
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(21) 



Here r» = min(Ja; — r» — nL\) is defined as the smallest dis- 
tance of any of the images of particle i to the point x. Be- 
cause the complementary error function rapidly suppresses 
the force for distances large compared to r s (the force drops 
to about 1% of its Newtonian value for r ~ 4.5 r s ), only this 
nearest image has any chance to contribute to the short- 
range force. 

The short-range force corresponding to Equation (21) 
can now be computed by the tree algorithm, except that 
the force law is modified by a short-range cut-off factor. 
However, the tree only needs to be walked in a small spatial 
region around each target particle, and no corrections for pe- 
riodic boundary conditions are required. Together this can 
result in a very substantial performance improvement, and 
in addition one typically gains accuracy in the long range 
force, which is now basically exact, and not an approxima- 
tion as in the tree method. Note that the TreePM approach 
maintains all of the most important advantages of the tree 
algorithm, namely its insensitivity to clustering, its essen- 
tially unlimited dynamic range, and its precise control about 
the softening scale of the gravitational force. 



3.2.1 Details of the TreePM algorithm 

To compute the PM part of the force, we use clouds- 
in-cells (CIC) assignment (Hockney & Eastwood, 1981) 
to construct the mass density field on to the mesh. We 
carry out a discrete Fourier transform of the mesh, and 
multiply it with the Greens function for the potential in 
periodic boundaries, —4ivG/k 2 , modified with the expo- 
nential truncation of the short-range force. We then de- 
convolve for the clouds-in-cell kernel by dividing twice 
with smc 2 (k x L/2N g )smc 2 (k y L/2N g )smc 2 (k z L/2N g ). One 
deconvolution corrects for the smoothing effect of the CIC 
in the mass assignment, the other for the force interpolation. 
After performing an inverse Fourier transform, we then ob- 
tain the gravitational potential on the mesh. 

We approximate the forces on the mesh by finite differ- 
encing the potential, using the four-point differencing rule 



dx 



ijk 



Ax ( 



1" 

1 

12 



i+2 ,j,k — ' 



(22) 



2,3, fe]) 



which offers order 0(Ax 4 ) accuracy, where Aa; = L/N mesil 
is the mesh spacing. It would also be possible to carry out 
the differentiation in Fourier space, by pulling down a factor 
—i k and obtaining the forces directly instead of the poten- 
tial. However, this would require an inverse Fourier trans- 
form separately for each coordinate, i.e. three instead of one, 
with little (if any) gain in accuracy compared to the four- 
point formula. 

Finally, we interpolate the forces to the particle posi- 
tion using again a CIC, for consistency. Note that the FFTs 
required here can be efficiently carried out using real-to- 
complex transforms and their inverse, which saves memory 
and execution time compared to fully complex transforms. 

In Figure 2, we illustrate the spatial decomposition of 
the force and show the force error of the PM scheme. This 



has been computed by randomly placing a particle of unit 
mass in a periodic box, and then measuring the forces ob- 
tained by the simulation code for a set of randomly placed 
test particles. We compare the force to the theoretically ex- 
pected exact force, which can be computed by Ewald sum- 
mation over all periodic images, and then by multiplying 
with the pre-factor 



ft = 1 - erfc 



(23) 



which takes out the short-range force, exactly the part that 
will be supplied by the short-range tree walk. The force er- 
rors of the PM force are mainly due to mesh anisotropy, 
which shows up on scales around the mesh size. However, 
thanks to the smoothing of the short-range force and the 
deconvolution of the CIC kernel, the mean force is very ac- 
curate, and the rms-force error due to mesh anisotropy is 
well below one percent. Note that these force errors compare 
favourably to the ones reported by P 3 M codes (Efstathiou 
et al., 1985, e.g.). Also, note that in the above formalism, 
the force anisotropy can be reduced further to essentially ar- 
bitrarily small levels by simply increasing r s , at the expense 
of slowing down the tree part of the algorithm. Finally we 
remark that while Fig 2 characterises the magnitude of PM 
force errors due to a single particle, it is not yet a realis- 
tic error distribution for a real mass distribution. Here the 
PM force errors on the mesh-scale can partially average out, 
while there can be additional force errors from the tree al- 
gorithm on short-scales. 



3.2.2 TreePM for 'zoom' simulations 

GADGET-2 is capable of applying the PM algorithm also 
for non-periodic boundary conditions. Here, a sufficiently 
large mesh needs to be used such that the mass distribution 
completely fits in one octant of the mesh. The potential 
can then be obtained by a real-space convolution with the 
usual 1/r kernel of the potential, and this convolution can 
be efficiently carried out using FFTs in Fourier space. For 
simplicity, GADGET obtains the necessary Green's function 
by simply Fourier transforming 1/r once, and storing the 
result in memory. 

However, it should be noted that the 4-point differ- 
encing of the potential requires that there are at least 2 
correct potential values on either side of the region that 
is actually occupied by particles. Since the CIC assign- 
ment/interpolation involves two cells, one therefore has the 
following requirement for the minimum dimension Af mcs h of 
the employed mesh: 



(JVmesh - 5) d > L, 



(24) 



where L is the spatial extension of the region occupied by 
particles and d is the size of a mesh cell. Recall that due 
to the necessary zero-padding, the actual dimension of the 
FFT that will be carried out is (2Af mcsh ) 3 . 

The code is also able to use a two-level hierarchy of FFT 
meshes. This was designed for 'zoom-simulations', which fo- 
cus on a small region embedded in a much larger cosmolog- 
ical volume. Some of these simulations can feature a rather 
large dynamic range, being as extreme as putting much more 
than 99% of the particles in less than 10 -10 of the volume 
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(Gao et al., 2005). Here, the standard TreePM algorithm is 
of little help since a mesh covering the full volume would 
have a mesh-size still so large that the high-resolution re- 
gion would fall into one or a few cells, so that the tree 
algorithm would effectively degenerate to an ordinary tree 
method within the high-resolution volume. 

One possibility to improve upon this situation is to 
use a second FFT-mesh that covers the high resolution re- 
gion, such that the long-range force is effectively computed 
in two steps. Adaptive mesh-placement in the AP 3 M code 
(Couchman et al., 1995) follows a similar scheme. GADGET- 
2 allows the use of such a secondary mesh-level and places 
it automatically, based on a specification of which of the 
particles are 'high-resolution particles'. However, there are 
a number of additional technical constraints in using this 
method. The intermediate-scale FFT works with vacuum 
boundaries, i.e. the code will use zero-padding and a FFT 
of size (2iV mea h) 3 to compute it. If Lhr is the maximum 
extension of the high-resolution zone (which may not over- 
lap with the boundaries of the box in case the base sim- 
ulation is periodic), then condition (24) for the minimum 
high-resolution cell size applies. But in addition, this inter- 
mediate scale FFT must properly account for the part of the 
short-range force that complements the long-range FFT of 
the whole box, i.e. it must be able to properly account for 
all mass in a sphere of size R cut around each of the high- 
resolution particles. There must hence be at least a padding 
region of size R cu t still covered by the mesh-octant used for 
the high-resolution zone. Because of the CIC assignment, 
this implies the constraint Lhr + 2i? cu t < dHR,(A' mes h — 1). 
This limits the dynamic range one can achieve with a single 
additional mesh- level. In fact, the high-resolution cell size 
must satisfy 



dim > max 



+ 2R C 



Lhr. 



N n 



(25) 



For our typical choice of R cut = 4.5 x r s — 1.25 x 4.5 x 
(1l,r, this means that the high-resolution mesh-size cannot 
be made smaller than dHR ^ 10 dLR/(^V mosh — 1), i.e. at 
least slightly more than 10 low-resolution mesh-cells must be 
covered by the high-resolution mesh. Nevertheless, provided 
one has a very large number of particles in a quite small 
high-resolution region, the resulting reduction of the tree 
walk time can outweight the additional cost of doing a large, 
zero-padded FFT for the high-resolution region. 

In Figure 3, we show the PM force error resulting for 
such a two-level decomposition of the PM force. We here 
placed a particle of unit mass randomly inside a high- 
resolution region of side- length 1/20 of a periodic box. We 
then measured the PM force accuracy of GADGET-2 by ran- 
domly placing test particles. Particles that were falling inside 
the high-resolution region were treated as high-resolution 
particles such that their PM-force consists of two FFT con- 
tributions, while particles outside the box receive only the 
long-range FFT force. In real simulations, the long-range 
forces are decomposed in an analogous way. With respect 
to the short-range force, the tree is walked with different 
values for the short-range cut-off, depending on whether a 
particle is characterised as belonging to the high-res zone or 
not. Note however that only one global tree is constructed 
containing all the mass. The top panel of Figure 3 shows the 
contributions of the different force components as a function 
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Figure 3. Force decomposition and force error of the TreePM 
scheme in the case when two meshes are used ('zoom- 
simulations'). The top panel illustrates the strength of the short- 
range (dot-dashed), intermediate-range (thick solid), and long- 
range force (solid) as a function of distance in a periodic box. The 
spatial scales of the two splits are marked with vertical dashed 
lines. The bottom panel shows the error distribution of the PM 
force. The outer matching region exhibits a very similar error 
characteristic as the inner match of tree- and PM-force. In both 
cases, for separations of order the fine or coarse mesh scale (dot- 
ted lines), respectively, force errors of up to 1 — 2 per cent arise, 
but the rms force error stays well below 1 per cent, and the mean 
force tracks the correct result accurately. 



of scale, while the bottom panel gives the distribution of the 
PM force errors. The largest errors occur at the matching 
regions of the forces. For realistic particle distributions, a 
large number of force components contribute, further reduc- 
ing the typical error due to averaging. 



4 TIME INTEGRATION 

4.1 The symplectic nature of the leapfrog 

Hamiltonian systems are not robust in the sense that they 
are not structurally stable against non-Hamiltonian pertur- 
bations. Numerical approximations for the temporal evolu- 
tion of a Hamiltonian system obtained from an ordinary 
numerical integration method (e.g. Runge-Kutta) in general 
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Figure 4. A Kepler problem of high eccentricity evolved with dif- 
ferent simple time integration schemes, using an equal timestep 
in all cases. Even though the leapfrog and the 2nd order Runge- 
Kutta produce comparable errors in a single step, the long term 
stability of the integration is very different. Even a computation- 
ally much more expensive 4th order Runge-Kutta scheme, with 
a smaller error per step, performs dramatically worse than the 
leapfrog in this problem. 



introduce non-Hamiltonian perturbations, which can com- 
pletely change the long-term behaviour. Unlike dissipative 
systems, Hamiltonian systems do not have attractors. 

The Hamiltonian structure of the system can be pre- 
served during the time integration if each step of it is formu- 
lated as a canonical transformation. Since canonical trans- 
formations leave the symplectic two-form invariant (equiv- 
alent to preserving Poincare's integral invariants, or stated 
differently, to preserving phase-space), such an integration 
scheme is called symplectic (Hairer et al., 2002, e.g.). Note 
that the time-evolution of a system can be viewed as a con- 
tinuous canonical transformation generated by the Hamilto- 
nian. If an integration step is the exact solution of a (partial) 
Hamiltonian, it represents the result of a phase-space con- 
serving canonical transformation and is hence symplectic. 

We now note that the Hamiltonian of the usual N-body 
problem is separable in the form 



H — //kin + H po 



(26) 



In this simple case, the time-evolution operators for each 
of the parts //kin and H pot can be computed exactly. This 
gives rise to the following drift and kick operators (Quinn 
et al., 1997): 



Dt(At) 



Kt(At) 



Pi 

'Ji' j 

P, 



rtii Jt 

Xi 

Pi + fi It 



t+Ai dt 



(27) 



(28) 



where f t = — y"V m^m., — is the force on particle i. 

Note that both D t and K t are symplectic operators 
because they are exact solutions for arbitrary At for the 
canonical transformations generated by the corresponding 
Hamiltonians. A time integration scheme can now be de- 
rived by the idea of operator splitting. For example, one can 
try to approximate the time evolution operator [/(At) for 
an interval At by 

U(At) = D^K(At)D^y (29) 
or 

U(At) = K^D(At)K^, (30) 

which correspond to the well-known drift-kick- drift (DKD) 
and kick- drift-kick (KDK) leapfrog integrators. Both of these 
integration schemes are symplectic, because they are a suc- 
cession of symplectic phase-space transformations. In fact, 
U generates the exact time evolution of a modified Hamil- 
tonian H. Using the Baker-Campbell-Hausdorff identity for 
expanding U and U, one can investigate the relation be- 
tween H and H . Writing H — H + H CII , one finds (Saha & 
Tremaine, 1992) 

Af 
12 

for the kick-drift-kick leapfrog. Provided H crr -C H, the evo- 
lution under H will be typically close to that under H. In 
particular, most of the Poincare integral invariants of H can 
be expected to be close to those of H, so that the long- 
term evolution of H will remain qualitatively similar to that 
of H. If H is time-invariant and conserves energy, then H 
will be conserved as well. For a periodic system, this will 



//err = |{//kin, //pot} , //kin + -//pot | + 0(At i 



(31) 
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then usually mean that the energy in the numerical solu- 
tion oscillates around the true energy, but there cannot be 
a long-term secular trend. 

We illustrate these surprising properties of the leapfrog 
in Figure 4. We show the numerical integration of a Ke- 
pler problem of high eccentricity e = 0.9, using second or- 
der accurate leapfrog and Runga-Kutta schemes with fixed 
timestep. There is no-long term drift in the orbital energy for 
the leapfrog result (top panel), only a small residual preces- 
sion of the elliptical orbit is observed. On the other hand, the 
Runge-Kutta integrator, which has formally the same error 
per step, catastrophically fails for an equally large timestep 
(middle panel). Already after 50 orbits the binding energy 
has increased by ~ 30%. If we instead employ a fourth or- 
der Runge-Kutta scheme using the same timestep (bottom 
panel), the integration is only marginally more stable, giv- 
ing now a decline of the binding energy by ~ 40% over 200 
orbits. Note however that such a higher order integration 
scheme requires several force computations per timestep, 
making it computationally much more expensive for a single 
step than the leapfrog, which requires only one force evalu- 
ation per step. The underlying mathematical reason for the 
remarkable stability of the leapfrog integrator observed here 
lies in its symplectic properties. 

4.2 Individual and adaptive timesteps 

In cosmological simulations, we are confronted with a large 
dynamic range in timescales. In high-density regions, like 
at the centres of galaxies, orders of magnitude smaller 
timesteps are required than in low-density regions of the 
intergalactic medium, where a large fraction of the mass 
resides. Evolving all particles with the smallest required 
timestep hence implies a substantial waste of computational 
resources. An integration scheme with individual timesteps 
tries to cope with this situation more efficiently. The prin- 
cipal idea is to compute forces only for a certain group of 
particles in a given kick operation, with the other particles 
being evolved on larger timesteps and being 'kicked' more 
rarely. 

Unfortunately, due to the pairwise coupling of parti- 
cles, a formally symplectic integration scheme with individ- 
ual timesteps is not possible, simply because the potential 
part of the Hamiltonian is not separable. However, one can 
partition the potential between two particles into a long- 
range and a short-range part, as we have done in the TreePM 
algorithm. This leads to a separation of the Hamiltonian into 

H — //kin + H Br + H\ T . (32) 

We can now easily obtain symplectic integrators as a gener- 
alisation of the ordinary leapfrog schemes by "subcycling" 
the evolution under -Hkm + H SI (Duncan et al., 1998). For 
example, we can consider 

U(At) = (33) 

M¥) [M&W£)M&)] m M¥) 

with m being a positive integer. This is the scheme 
GADGET-2 uses for integrating simulations run with the 
TreePM algorithm. The long-range PM force has a com- 
paratively large timestep, which is sufficient for the slow 
time-variation of this force. Also, we always evaluate this 
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Figure 5. A Kepler problem of high eccentricity integrated with 
leapfrog schemes using a variable timestep from step to step, 
based on the At or. 1/ y^\a\ criterion commonly employed in cos- 
mological simulations. As a result of the variable timesteps, the 
integration is no longer manifestly time reversible, and long term 
secular errors develop. Interestingly, the error in the KDK variant 
grows four times slower than in the DKD variant, despite being 
of equal computational cost. 



force for all particles. The evolution under the short-range 
force, however, which varies on shorter timescales, is done on 
a power of 2 subdivided timescale. Here, we optionally also 
allow particles to have individual timesteps, even though 
this perturbs the symplectic nature of the integration (see 
below). Note that unlike the PM algorithm, tree forces can 
be easily computed for a small fraction of the particles, at 
a computational cost that is to first order strictly propor- 
tional to the number of particles considered. This is true as 
long as the subfraction is not so small that tree construction 
overhead becomes significant. PM-forces on the other hand 
are either "all" or "nothing". The above decomposition is 
hence ideally adjusted to these properties. 

Note that despite the somewhat complicated appear- 
ance of equation (33), the integration scheme is still a simple 
alternation of drift and kick operators. In practice, the sim- 
ulation code simply needs to drift the whole particle system 
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to the next synchronisation point where a force computation 
is necessary. There, a fraction of the particles receive a force 
computation and their momenta are updated accordingly, 
as illustrated in the sketch of Figure 6. Then the system is 
drifted to the next synchronisation point. 

As we have discussed, the integration is no longer 
symplectic in a formal sense when individual short-range 
timesteps are chosen for different particles. However, in the 
limit of collisionless dynamics, we can argue that the par- 
ticle number is so large that particles effectively move in a 
collective potential, where we assume that any force between 
two particles is always much smaller than the total force. In 
this desired limit, two-body collisions become unimportant, 
and the motion of particles is to good approximation colli- 
sionless. We can then approximate the particles as moving 
quasi-independently in their collective potential, which we 
may describe by a global potential $(cc, i). Obviously, in this 
approximation the Hamiltonian separates into a sum of sin- 
gle particle Hamiltonians, where we have now hidden their 
coupling in the collective potential 3>(x, t). Provided we fol- 
low the evolution of each particle accurately in this fiducial 
collective potential <&(x, t), the evolution of the potential it- 
self will also be faithful, justifying the integration of particles 
with individual timesteps in a N-body system that behaves 
collisionlessly. While not formally being symplectic, the evo- 
lution can then be expected to reach comparable accuracy 
to a phase-space conserving symplectic integration. 

Treating the potential as constant for the duration of a 
kick, each particle can be integrated by a sequence of KDK 
leapfrogs, which may have different timestep from step to 
step. Note that changing the timestep in the leapfrog from 
step to step does not destroy the simplecticity of the inte- 
grator, because the implied transformation is constructed 
from steps which are simplectic individually. However, what 
one finds in practice is that the superior long-term sta- 
bility of periodic motion is typically lost. This is because 
each time the timestep is changed, the error Hamiltonian 
appearing in equation (31) is modified. This introduces an 
artificial temporal dependence into the numerical Hamilto- 
nian which is not in phase with the orbit itself because the 
timestep criterion usually involves information from the pre- 
vious timestep. The associated time-asymmetry destroys the 
formal time reversibility of the integration, and the phase- 
lag of the timestep cycle in each orbit produces a secular 
evolution. We illustrate this behaviour in Figure 5 for an in- 
tegration of the Kepler problem considered earlier, but this 
time using a leapfrog with an adaptive timestep according 
to At oc where a is the acceleration of the last 

timestep. Interestingly, while being equivalent for a fixed 
timestep, the DKD and KDK leapfrogs behave quite differ- 
ently in this test. For the same computational effort, the 
energy error grows four times as fast in the DKD scheme 
compared with the KDK scheme. This is simply because the 
effective time asymmetry in the DKD scheme is effectively 
twice as large. To see this, consider what determines the size 
of a given timestep when integrating forward or backwards 
in time. In the DKD scheme, the relevant acceleration that 
enters the timestep criterion stems from a moment that lies 
half a timestep before or behind the given step. As a result, 
there is temporal lapse of two timesteps between forward 
and backwards integration. For the KDK, the same consid- 
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Figure 6. Schematic illustration of the short- and long-range 
timestepping used by GADGET-2. The code always drifts the 
whole particle system to the next time when a force computation 
is required. At that time 'kicks', i.e. changes of the particle mo- 
menta, are applied based on short-range or long-range forces, or 
on both. 



eration leads only to a temporal asymmetry of one timestep, 
half as large. 

The KDK scheme is hence clearly superior once one al- 
lows for individual timesteps. It is also possible to try to 
recover time reversibility more precisely. Hut et al. (1995) 
discuss an implicit timestep criterion that depends both on 
the beginning and end of the timestep, and similarly, Quinn 
et al. (1997) discuss a binary hierarchy of trial steps that 
serves a similar purpose. However, these schemes are com- 
putationally impractical for large collisionless systems. But 
fortunately, here the danger to build up large errors by sys- 
tematic accumulation over many periodic orbits is much 
smaller, because the gravitational potential is highly time- 
dependent and the particles tend to make comparatively few 
orbits over a Hubble time. 

In the normal integration mode of GADGET-2, we dis- 
cretise the timesteps in a power of 2 hierarchy, where all 
timesteps are a power of 2 subdivision of a global timestep. 
Particles may always move to a smaller timestep, but to a 
larger one only every second step, when this leads to syn- 
chronisation with the higher timestep hierarchy. The level of 
synchronisation achieved by this is beneficial for minimising 
the required number of particle drifts and tree constructions. 
Alternatively, the code also allows a more flexible way to 
populate timesteps, where timesteps are discretised as inte- 
ger multiples of the minimum timestep occuring among the 
particle set. This has the advantage of producing a more ho- 
mogenous distribution of particles across the timeline, which 
can simplify work-load balancing. 



4.3 Time integration scheme of SPH particles 

For gas particles, similar considerations apply in principle, 
since in the absence of viscosity, SPH can also be formu- 
lated as a Hamiltonian system. However, because shocks oc- 
cur in any non-trivial flow, hydrodynamics will in practice 
always be irreversible, hence the long-term integration as- 
pects of Hamiltonian systems do not apply as prominently 
here. Also, in systems in hydrodynamic equilibrium the gas 
particles do not move, and hence do not tend to accumulate 
errors over many orbits as in dynamical equilibrium. How- 
ever, if SPH particles are cold and rotate in a centrifugally 
supported disk, long-term integration aspects can become 
important again. So it is desirable to treat the kinematics 
of SPH particles in close analogy to that of the collisionless 
particles. 

The reversible part of hydrodynamics can be described 
by adding the thermal energy to the Hamiltonian, viz. 
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#thcrm — - 1 - ^ mjAipJ \ (34) 

Note that the SPH smoothing lengths are implicitly given by 
equations (6), i.e. the thermal energy depends only on the 
entropy per unit mass, and the particle coordinates. Hence 
the same considerations as for the collisionless leapfrog ap- 
ply, and as long as there is no entropy production included, 
time integration is fully time reversible. This is actually a 
considerable difference to mesh codes which in non-trivial 
flows always produce some entropy due to mixing, even when 
the fluid motion should in principle be fully adiabatic. These 
errors arise from the advection over the mesh, and are absent 
in the above formulation of SPH. 



5 PARALLELISATION STRATEGIES 

There are a number of different design philosophies for 
constructing powerful supercomputers. So-called sector- 
machines employ particularly potent CPUs which can si- 
multaneously carry out computational operations on whole 
arrays of floating point numbers. However, not all algorithms 
can easily exploit the full capabilities of such vector proces- 
sors, ft is easier to use scalar architectures, but here large 
computational throughput is only achieved by the simulta- 
neous use of a large number of processors. The goal is to 
let these CPUs work together on the same problem, thereby 
reducing the time to solution and allowing larger problem 
sizes. Unfortunately, the required parallelisation of the ap- 
plication program is not an easy task in general. 

On symmetric multiprocessing (SMP) computers, sev- 
eral scalar CPUs share the same main memory, so that time- 
intensive loops of a computation can be distributed easily for 
parallel execution on several CPUs using a technique called 
threading. The code for creation and destruction of threads 
can be generated automatically by sophisticated modern 
compilers, guided by hints inserted into the code in the 
form of compiler directives (e.g. based on the OpenMP stan- 
dard). The primary advantage of this method lies in its ease 
of use, requiring few (if any) algorithmic changes in exist- 
ing serial code. A disadvantage is that the compiler-assisted 
parallelisation may not always produce an optimum result, 
and depending on the code, sizable serial parts may remain. 
A more serious limitation is that this technique prevents 
one from using processor numbers and memory larger than 
available on a particular SMP computer. Also, such shared- 
memory SMP computers tend to be substantially more ex- 
pensive than a set of single computers with comparable per- 
formance, with the price-tag quickly rising the more CPUs 
are contained within one SMP computer. 

A more radical approach to parallelisation is to treat 
different scalar CPUs as independent computers, each of 
them having their own separate physical memory, and each 
of them running a separate instance of the application code. 
This approach requires extension of the program with in- 
structions that explicitly deal with the necessary commu- 
nication between the CPUs to split up the computational 
work and to exchange partial results. Memory is distributed 
in this method. In order to allow a scaling of the problem 
size with the total available memory, each CPU should only 
store a fraction of the total data of the problem in its own 



memory. Successful implementation of this paradigm there- 
fore requires substantial algorithmic changes compared to 
serial programs, and depending on the problem, a consid- 
erably higher complexity than in corresponding serial codes 
may result. However, such massively parallel programs have 
the potential to be scalable up to very large processor num- 
ber, and to exploit the combined performance of the CPUs 
in a close to optimum fashion. Also, such codes can be run 
on computers of comparatively low cost, like clusters of or- 
dinary PCs. 

GADGET-2 follows this paradigm of a massively par- 
allel simulation code. It contains instructions for commu- 
nication using the standardised 'Message Passing Interface' 
(MPI). The code itself was deliberately written using the 
language C (following the ANSI-standard) and the open- 
source libraries GSL and FFTW. This results in a very high 
degree of portability to the full family of UNIX systems, 
without any reliance on special features of proprietary com- 
pilers. The parallelisation algorithms of the code are flexible 
enough to allow its use on an arbitrary number of proces- 
sors, including just one. As a result GADGET-2 can be run on 
large variety of machines, ranging from a laptop to clusters 
of the most powerful SMP computers presently available. In 
the following, we describe in more detail the parallelisation 
algorithms employed by the code. 

5.1 Domain decomposition and Peano-Hilbert 
order 

Since large cosmological simulations are often memory- 
bound, it is essential to decompose the full problem into 
parts that are suitable for distribution to individual pro- 
cessors. A commonly taken approach in the gravitational 
N-body/SPH problem is to decompose the computational 
volume into a set of domains, each assigned to one proces- 
sor. This has often been realised with a hierarchical orthog- 
onal bisection, with cuts chosen to approximately balance 
the estimated work for each domain (e.g. Dubinski, 1996). 
However, a disadvantage of some existing implementations 
of this method is that the geometry of the tree eventually 
constructed for each domain depends on the geometry of 
the domains themselves. Because the tree force is only an 
approximation, this implies that individual particles may ex- 
perience a different force error when the number of CPUs is 
changed, simply because this in general modifies the way the 
underlying domains are cut. Of course, provided the typical 
size of force errors is sufficiently small, this should not pose 
a severe problem for the final results of collisionless sim- 
ulations. However, it complicates code validation, because 
individual particle orbits will then depend on the number of 
processors employed. Also, there is the possibility of subtle 
correlations of force errors with domain boundaries, which 
could especially in the very high redshift regime show up as 
systematic effects. 

Here we propose a new scheme for domain decompo- 
sition that guarantees a force that is independent of the 
processor number. It also avoids other shortcomings of the 
orthogonal bisection, such as high aspect-ratios of domains. 
Our method uses a space-filling fractal, the Peano-Hilbert 
curve, to map 3D space onto a ID curve. The latter is then 
simply chopped off into pieces that define the individual do- 
mains. The idea of using a space-filling curve for the domain 
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Figure 7. Space-filling Peano-Hilbert curve in two (bottom) and three (top) dimensions. 



decomposition of a tree code has first been proposed by War- 
ren & Salmon (1993); Warren & Salmon (1995). They how- 
ever used Morton ordering for the underlying curve, which 
produces irregularly shaped domains. 

In Figure 7, we show examples of the Peano-Hilbert 
curve in two and three dimensions. The Peano-curve in 2D 
can be constructed recursively from its basic 'U'-shaped 
form that fills a 2 x 2 grid, together with the rules that 
determine the extension of this curve onto a 4 x 4 grid. As 
can be seen in Fig 7, these rules mean that the bar of the 
"U" has to be replaced with two smaller copies of the un- 
derlying "U" , while at the two ends, rotated and mirrored 
copies have to be placed. By repeated application of these 
rules one can construct an area-filling curve for arbitrarily 
large grids of size 2™ x 2™. In three dimensions, a basic curve 
defined on a 2 x 2 x 2 grid can be extended in an analogous 
way, albeit with somewhat more complicated mapping rules, 
to the three dimensional space-filling curve shown in Fig. 7. 

An interesting property of these space-filling curves is 
their self-similarity. Suppose we describe the Peano-Hilbert 
curve that fills a 2" x 2" x 2" grid with a one-to-one mapping 
p n (i,j, k), where the value p n £ [0, . . . , n 3 — 1] of the func- 
tion is the position of the cell k) along the curve. Then 
we have p n / 2 {i/2, j/2, k/2) = p n (i,j,k)/8, where all divi- 
sions are to be understood as integer divisions. We can hence 
easily "contract" a given Peano-Hilbert curve and again ob- 
tain one of lower order. This is a property we exploit in the 
code. 

A second important property is that points that are 
close along the one-dimensional Peano-Hilbert curve are in 
general also close in three dimensional space, i.e. the map- 
ping preserves locality. If we simply cut a space-filling Peano- 
curve into segments of a certain length, we obtain a domain 
decomposition which has the property that the spatial do- 
mains are simply connected and quite 'compact', i.e. they 
tend to have small surface-to-volume ratios and low aspect 



ratios, a highly desirable property for reducing communica- 
tion costs with neighbouring domains. 

Third, we note that there is a close correspondence be- 
tween the spatial decomposition obtained by a hierarchical 
BH oct-tree, and that obtained from segmenting a Peano- 
Hilbert curve. For example, consider a fiducial Peano-Hilbert 
curve that fills a box (the root node), encompassing the 
whole particle set. Cutting this curve into 8 equally long 
pieces, and then recursively cutting each segment into 8 
pieces again, we regenerate the spatial oct-tree structure of 
the corresponding BH tree. If we hence assign an arbitrary 
segment of the Peano-Hilbert curve to a processor, the cor- 
responding volume is compatible with the node structure of 
a fiducial global BH tree covering the full volume, i.e. we 
effectively assign a collection of branches of this tree to each 
processor. Because of this property, we obtain a tree whose 
geometry is not affected by the parallelisation method, and 
the results for the tree force become strictly independent of 
the number of processors used. 

We illustrate these concepts in Figure 8, where we show 
a sketch of a global BH tree and its decomposition into do- 
mains by a Peano-Hilbert curve. For simplicity, we show the 
situation in 2D. Note that the sizes of the largest nodes as- 
signed to each processor in this way need not all be of the 
same size. Instead, the method can quite flexible adjust to 
highly clustered particle distributions, if required. 

In order to carry out the domain decomposition in prac- 
tice, we first compute a Peano-Hilbert key for each particle. 
This is simply the integer returned by the function p, where 
the coordinates of particles are mapped onto integers in the 
range [0, 2" — 1]. The construction of the Peano-Hilbert key 
can be carried out with a number of fast bit-shift operations, 
and short look-up tables that deal with the different orien- 
tations of the fundamental figure. We typically use n = 20, 
such that the key fits into a 64-bit long integer, giving a dy- 
namic range of the Peano-Hilbert curve of ~ 10 6 per dimen- 
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Figure 8. Illustration of the relation between the BH oct-tree and a domain decomposition based on a Peano-Hilbert curve. For clarity, 
the sketch is drawn in two dimensions. The fiducial Peano curve associated with the simulation volume visits each cell of a regular mesh 
exactly once. The simulation volume is cut into domains by segmenting this curve at arbitrary intermediate points on cell boundaries. This 
generates a rule for distributing the particle set onto individual processors. Since the geometric structure of the BH tree is commensurable 
with the mesh, each mesh-cell corresponds to a certain branch of a fiducial global BH tree. These branches then reside entirely on single 
processors. In addition, each processor constructs a 'top-level tree' where all nodes at higher level are represented. The missing data on 
other processors is marked using 'pseudo-particles' in this tree. 



sion. This is enough for all present applications but could 
be easily extended if needed. 

In principle, one would then like to sort these keys and 
divide the sorted list into segments of approximately con- 
stant work-load. However, since the particle data (includ- 
ing the keys) is distributed, a global sort is a non-trivial 
operation. We solve this problem using an adaptive hash- 
ing method. Each processor first considers only its locally 
sorted list of keys and uses it to recursively construct a set 
of segments (by chopping segments into 8 pieces of equal 
length) until each holds at most sN/N cpu particles, where 
we usually take s ~ 0.1. This operation partitions the load 
on each processor into a set of reasonably fine pieces, but 
the total number of these segments remains small, indepen- 
dent of the clustering state of matter. Next, a global list of 
all these segments is established, and segments that overlap 
are joined and split as needed, so that a global list of seg- 
ments results. It corresponds to a BH tree where the leaf 
nodes hold of order sN/N cpu particles. We can now assign 
one or several consecutive segments to each processor, with 
the divisions chosen such that an approximate work-load 
balance is obtained, subject to the constraint of a maximum 
allowed memory imbalance. The net result of this procedure 
is that a range of keys is assigned to each processor, which 
defines the domain decomposition and is now used to move 
the particles to their target processors, as needed. Note that 



unlike a global sort, the above method requires little com- 
munication. 

For the particles of each individual processor, we then 
construct a BH-tree in the usual fashion, using the full ex- 
tent of the particle set as the root grid size. In addition, 
we insert "pseudo-particles" into the tree, which represent 
the mass on all other processors. Each of the segments in 
the global domain-list which was not assigned to the local 
processor is represented by a pseudo-particle. In the tree, 
these serve as placeholders for branches of the tree that re- 
side completely on a different processor. We can inquire the 
multipole moments of such a branch from the corresponding 
remote processor, and give the pseudo particle these proper- 
ties. Having inserted the pseudo-particles into each local tree 
therefore results in a "top-level tree" that complements the 
tree branches generated by local particles. The local tree is 
complete in the sense that all internal nodes of the top-level 
tree have correct multipole-moments, and they are indepen- 
dent of the domain decomposition resulting for a given pro- 
cessor number. However, the local tree has some nodes that 
consist of pseudo-particles. These nodes cannot be opened 
since the corresponding particle data resides on a different 
processor, but when encountered in the tree walk, we know 
precisely on which processor this information resides. 

The parallel tree-force computation proceeds therefore 
as follows. For each of its (active) local particles, a processor 
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walks its tree in the usual way, collecting force contributions 
from a set of nodes, which may include top-level tree nodes 
and pseudo particles. If the node represented by a pseudo- 
particle needs to opened, the walk along the corresponding 
branch of the tree cannot be continued. In this case, the 
particle is nagged for export to the processor the pseudo- 
particle came from, and its coordinates are written into a 
buffer-list, after which the tree walk is continued. If needed, 
the particle can be put several times into the buffer list, 
but at most once for each target processor. After all local 
particles have been processed, the particles in the buffer are 
sorted by the rank of the processor they need to be sent to. 
This collects all the data that needs to be sent to a certain 
processor in a contiguous block, which can then be com- 
municated in one operation based on a collective hypercube 
communication model. The result is a list of imported parti- 
cles for which the local tree is walked yet again. Unlike in the 
normal tree walk for local particles, all branches of the tree 
that do not exclusively contain local mass can be immedi- 
ately discarded, since the corresponding force contributions 
have already been accounted for by the processor that sent 
the particle. Once the partial forces for all imported particles 
have been computed, the results are communicated back to 
the sending processors, using a second hypercube commu- 
nication. A processor that sent out particles receives in this 
way force contributions for nodes that it could not open 
locally. Adding these contributions to the local force com- 
puted in the first step, the full force for each local particle 
is then obtained. The forces are independent of the number 
of processors used and the domain cuts that where made. In 
practice, numerical round-off can still introduce differences 
however, since the sequence of arithmetic operations that 
leads to a given force changes when the number of CPUs is 
modified. 

Unlike in GADGET- 1, particles are not automatically 
exported to other processors, and if they are, then only 
to those processors that hold information that is directly 
needed in the tree walk. Particularly in the TreePM scheme 
and in SPH, this leads to a drastic reduction in the required 
communication during the parallel force computations, an 
effect that is particularly important when the number of 
CPUs is large. Since the domains are locally compact and 
the tree-walk is restricted to a small short-range region in 
SPH and TreePM, most particles will lie completely inside 
the local domain, requiring no information from other pro- 
cessors at all. And if they have to be exported, then typically 
only to one or a few other processors. We also remark that 
the above communication scheme tends to hide communica- 
tion latency, because the processors can work independently 
on (long) lists of particles before they meet for an exchange 
of particles or results. 

Finally, we note that we apply the Peano-Hilbert curve 
for a second purpose as well. Within each local domain, we 
order the particles in memory according to a finely resolved 
Peano-Hilbert curve. This is done as a pure optimisation 
measure, designed to increase the computational speed. Be- 
cause particles that are adjacent in memory after Peano- 
Hilbert ordering will have close spatial coordinates, they 
also tend to have very similar interaction lists. If the mi- 
croprocessor works on them consecutively, it will hence in 
many cases find the required data for tree-nodes already in 
local cache-memory, which reduces wait cycles for the slower 



main memory. Our test results show that the Peano-Hilbert 
ordered particle set can result in nearly twice the perfor- 
mance compared to random order, even though the actual 
tree code that is executed is the same in both cases. The 
exact speed-up obtained by this trick is architecture- and 
problem-dependent, however. 

5.2 Parallel Fourier transforms 

In the TreePM algorithm, we not only need to parallelise 
the tree algorithm, but also the PM computations. For the 
Fourier transforms themselves we employ the massively par- 
allel version of the FFTW library developed at MIT. The 
decomposition of the data is here based on slabs along one 
coordinate axis. The Fourier transform can then be carried 
out locally for the coordinate axes parallel to the slabs, but 
the third dimension requires a global transpose of the data 
cube, a very communication intensive step which tends to 
be quite restrictive for the scalability of massively parallel 
FFTs, unless the communication bandwidth of the computer 
is very high. Fortunately, in most applications of interest, the 
cost of the FFTs is so subdominant that even a poor scal- 
ing remains unproblematic up to relatively large processor 
numbers. 

A more important problem lies in the slab data layout 
required by the FFT, which is quite different from the, to 
first order, 'cubical' domain decomposition that is ideal for 
the tree algorithm. Dubinski et al. (2004) and White (2002) 
approached this problem by choosing a slab decomposition 
also for the tree algorithm. While being simple, this poses 
severe restrictions on the combinations of mesh size and pro- 
cessor number that can be run efficiently. In particular, in 
the limit of large processor number, the slabs become very 
thin, so that work-load balancing can become poor. In ad- 
dition, due to the large surface to volume ratio of the thin 
slabs, the memory cost of ghost layers required for the CIC 
assignment and interpolation schemes can become quite siz- 
able. In fact, in the extreme case of slabs that are one mesh 
cell wide, one would have to store three ghost layer zones, 
which would then have to come from more than one proces- 
sor on the 'left' and 'right'. 

An obvious alternative is to use different decomposi- 
tions for the tree algorithm and the PM part. This is the 
approach GADGET-2 uses. One possibility would be to swap 
the data between the Peano-Hilbert decomposition, and the 
slab decomposition whenever a PM force computation is 
necessary. However, this approach has a number of draw- 
backs. First of all, it would require the exchange of a sub- 
stantial data volume, because almost all particles and their 
associated data would have to be moved in the general 
case. Second, since the slab decomposition essentially en- 
forces an equal volume decomposition, this may give rise 
to large particle-load imbalance in highly clustered simula- 
tions, for example in "zoom" simulations. An extreme case 
of this problem would be encountered when FFTs with vac- 
uum boundaries are used. Here at least half of the slabs, 
and hence processors, would be completely devoid of par- 
ticles if the particle set was actually swapped to the slab 
decomposition. 

We therefore implemented a second possibility, where 
the particle data remains in place, i.e. in the order estab- 
lished for the tree algorithm. For the FFT, each processor 
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determines by itself with which slab its local particle data 
overlaps. For the corresponding patch, the local particle data 
is then CIC-binned, and this patch is transmitted to the pro- 
cessor that holds the slab in the parallel FFT. In this way, 
the required density field for each slab is constructed from 
the contributions of several processors. In this scheme only 
the scalar density values are transmitted, which is a substan- 
tially smaller data volume than in the alternative scheme, 
even when the PM grid is chosen somewhat larger than the 
effective particle grid. After the gravitational potential has 
been computed, we collect in the same way the potential for 
a mesh that covers the local particle set. We can here pull the 
corresponding parts from the slabs of individual processors, 
including the ghost layers required around the local patch for 
finite differencing of the potential. Since the local domains 
are compact, they have a much smaller surface-to- volume 
ratio than the slabs, so that the memory cost of the ghost 
layers remains quite small. After the local patch of the po- 
tential has been assembled, it can be finite differenced and 
interpolated to the particle coordinates without requiring 
any additional communication. This method hence inlines 
the PM computation in a quite flexible way with the tree 
algorithm, without putting any restriction on the allowed 
processor number, and avoiding, in particular, the memory- 
and work-load balancing issues mentioned above. 

5.3 Parallel I/O 

Current cosmological simulations have reached a substantial 
size, with particle numbers well in excess of 10 7 done quite 
routinely. Time slices of such simulations can reach up to 
a few GByte in size, at which point it becomes very time- 
consuming to write or read this data sequentially on a single 
processor. Also, it can be impractical to store the data in a 
single file. GADGET-2 therefore allows simulation data to be 
split across several files. Each file is written or read by one 
processor only, with data sent to or received by a group of 
processors. Several of these file can be processed in parallel. 
This number can be either equal to the total number of files 
requested, or restricted to a smaller value in order to prevent 
a "flooding" of the I/O subsystem of the operating system, 
which can be counterproductive. Unlike in previous versions 
of the code, GADGET-2 does not pose restrictions on the 
number of files and the number of simultaneously processed 
files in relation to the number of processors that is used. 

In the largest simulation carried out with GADGET-2 
thus far, a simulation with 2160 3 particles (Springel et al., 
2005), the total size of a snapshot slice was more than 300 
GB. Using parallel I/O on the high-performance IBM p690 
system of the MPG computing centre in Garching, these 
time slices could be written in slightly less than 300 seconds, 
translating in an effective disk bandwidth of ~ 1 GB/sec. 
Without parallel I/O, this would have taken a factor ~ 50 — 
60 longer. 

5.4 Miscellaneous features 

We note that unlike previous versions, GADGET-2 can be 
run on an arbitrary number of processors, including a sin- 
gle processor. There is hence no need any more for separate 
serial and parallel versions. Lifting the restriction for the 



processor numbers to be powers of two can be quite use- 
ful, particularly for loosely coupled clusters of workstations, 
where windows of opportunity for simulations may arise that 
offer 'odd' processor numbers for production runs. 

This flexibility is achieved despite the code's use of a 
communication model that operates with synchronous com- 
munication exclusively. The principal model for communica- 
tion in the force computations follows a hypercube strategy. 
If the processor number is a power of two, say 2 P , then a 
full all-to-all communication cycle can be realized by 2 P — 1 
cycles, where in each cycle 2 P_1 disjoint processor pairs arc 
formed that exchange messages. If the processor number is 
not a power of two, this scheme can still be used, but the 
processors need to be embedded in the hypercube scheme 
corresponding to the next higher power of two. As a re- 
sult, some of the processors will be unpaired in a subfraction 
of the communication cycle, lowering the overall efficiency 
somewhat. 

GADGET-2 can also be used to set-up 'glass' initial con- 
ditions, as suggested by White (1996). Such a particle dis- 
tribution arises when a Poisson sample in an expanding pe- 
riodic box is evolved with the sign of gravity reversed until 
residual forces have dropped to negligible values. The glass 
distribution then provides an alternative to a regular grid 
for use as an unperturbed initial mass distribution in cos- 
mological simulations of structure formation. To speed up 
convergence, the code uses an "inverse Zel'dovich" approxi- 
mation based on the measured forces to move the particles 
to their estimated Lagrangian positions. 

We have also added the ability to simulte gas-dynamical 
simulations in two dimensions, both with and without 
periodic boundary conditions. A further new feature in 
GADGET-2 is the optional use of the Hierarchical Data For- 
mat (HDF5), developed by NCSA. This allows storage of 
snapshot files produced by GADGET-2 in a platform inde- 
pendent form, simplifying data exchange with a variety of 
analysis software. 



6 TEST PROBLEMS 

Unfortunately, it is not possible to formally demonstrate the 
correctness of complex simulation codes such as GADGET-2. 
However, the reliability of a code can be studied empirically 
by applying it to a wide range of problems, under a broad 
range of values of nuisance code parameters. By comparing 
with known analytic solutions and other independent nu- 
merical methods, an assessment of the numerical reliability 
of the method can be established, which is essential for trust- 
ing the results of simulations where no analytic solutions are 
known (which is of course the reason to perform simulations 
to begin with). 

We begin with a simple shock-tube test for the SPH 
component of GADGET-2, which has known analytic solu- 
tions. We then consider the more elaborate problem of the 
collapse of a cold sphere of gas under self-gravity. This three- 
dimensional problem couples self-gravity and gas dynamics 
over a dynamic range similar to that encountered in struc- 
ture formation simulations. There are no analytic solutions, 
but highly accurate results from ID shock-capturing codes 
exist for comparison. We then move on and consider the 
highly dissipative collapse of an isothermal cloud of gas, 
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ulation of the formation of a rich cluster of galaxies. The cor- 
rect solution for this complex problem, which is directly tied 
to our theoretical understanding of the intracluster medium, 
is not known. However, results for GADGET-2 can be com- 
pared to the 12 codes examined in Frenk et al. (1999), which 
can serve as a broad consistency check. 

Finally, we briefly consider a further hydrodynamical 
test problem, which involves strong shocks and vorticity gen- 
eration. This is the interaction of a blast wave with a cold 
cloud of gas embedded at pressure equilibrium in ambient 
gas. This forms an advanced test of the capabilities of the 
SPH solver and has physical relevance for models of the in- 
terstellar medium, for example. 



6.1 Shock tube 

We begin by considering a standard Sod shock-tube test, 
which provides a useful validation of the code's ability to 
follow basic hydrodynamical phenomena. We consider an 
ideal gas with 7 = 1.4, initially at rest, where the half-space 
x < is filled with gas at unit pressure and unit density 
(pi = 1, Pi = 1), while x > is filled with high pressure 
gas (P 2 = 0.1795) of lower density (p 2 = 0.25). These ini- 
tial conditions have been frequently used as a test for SPH 
codes (e.g. Hernquist & Katz, 1989; Rasio & Shapiro, 1991; 
Wadsley et al., 2004). We realize the initial conditions in 
3D using an irregular glass-like distribution of particles of 
equal mass, embedded in a periodic box that is longer in the 
x-direction than in the y- and z-directions. 

In Figure 9, we show the result obtained with GADGET- 
2 at time t = 5.0. The agreement with the analytic solution is 
good, with discontinuities resolved in about ~ 3 interparticle 
separations, or equivalently 2 — 3 SPH smoothing lengths. 
At the contact discontinuity, a characteristic pressure blip 
is observed, and some excess entropy has been produced 
there as a result of the sharp discontinuity in the initial 
conditions, which has not been smoothed out and therefore 
is not represented well by SPH at t = 0. Note that while the 
shock is broadened, the post-shock temperature and density 
are computed very accurately. 



Figure 9. Sod shock test carried out in three dimensions. The 
gas is initially at rest with pi = 1.0, Pi = 1.0 for x < 0, and 
P2 = 0.25, P2 = 9.1795 for x > 0. The numerical result is shown 
with circles (with a spacing equal to the mean particle spacing in 
the low-density region) and compared with the analytic result at 
t = 5.0 . A shock of Mach-number M = 1.48 develops. 

the "standard isothermal test case" of Boss & Bodenheimer 
(1979), where we carry out a resolution study that examines 
the reliability of the onset of fragmentation. 

As a test of the accuracy of the dark matter dynam- 
ics, we consider the dark matter halo mass function and the 
two-point correlation function obtained for two 256 3 simu- 
lations of cosmological structure formation. Our initial con- 
ditions are the same as those used recently by Heitmann 
et al. (2004) in a comparison of several cosmological codes. 
We also use their results obtained for these different codes 
to compare with GADGET-2. 

We then consider the formation of the 'Santa Barbara 
cluster' (Frenk et al., 1999), a realistic hydrodynamical sim- 



6.2 Collapse of an adiabatic gas sphere 

A considerably more demanding test problem is the adia- 
batic collapse of an initially cold gas cloud under its own self- 
gravity. Originally proposed by Evrard (1988), this prob- 
lem has been considered by many authors (e.g. Hernquist & 
Katz, 1989; Dave et al., 1997; Wadsley et al., 2004) as a test 
of cosmological codes. The initial conditions in natural units 
(G = 1) take the form of a spherical 7 = 5/3 cloud of unit 
mass and unit radius, with a p oc 1/r density profile, and 
with an initial thermal energy per unit mass of u = 0.05. 
When evolved forward in time, the cloud collapses gravita- 
tionally until a central bounce develops with a strong shock 
moving outward. 

In Figure 10 we show spherically averaged profiles of 
density, radial velocity and entropy of the system at time 
t — 0.8, and compare it to a ID high-precision calculation 
carried out with a PPM scheme by Steinmetz & Mueller 
(1993). An analytic solution is not available for this prob- 
lem. We show results for two different resolutions, 1.56 x 10 6 
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Figure 10. Adiabatic collapse of a gas sphere ('Evrard'-test). At time t = 0.8, we show radial profiles of density, velocity, and entropy 
for two different resolutions, in the top row for 1.56 X 10 6 particles, in the bottom row for 1.95 X 10 5 particles. The solid lines mark 
the result of a ID PPM calculation (Steinmetz & Mueller, 1993), which can be taken as a quasi exact result in this case. The 3D SPH 
calculations reproduce the principal features of this solution generally quite well. But as expected, the shock is broadened, and also shows 
some pre-shock entropy generation. The latter effect is particularly strong in this spherically symmetric problem because of the rapid 
convergence of the flow in the infall region in front of the shock, which triggers the artificial viscosity. However, the post-shock properties 
of the flow are correct. 



and 1.95 x 10 5 particles; lower resolution runs are still able 
to reproduce the overall solution well, although the shock 
becomes increasingly more broadened. We see that for suffi- 
ciently high resolution, the 3D SPH calculation reproduces 
the ID PPM result reasonably well. In the region just out- 
side the shock, we see appreciable pre-shock entropy gener- 
ation. As pointed out by Wadsley et al. (2004), this arises 
due to the artificial viscosity which is here already triggered 
at some level by the strong convergence of the flow in the 
pre-shock region. However, this does not lead to any ap- 
parent problems for the post-shock flow. Note that thanks 
to our entropy formulation, the entropy profile is also well 
reproduced at the outer edge of the flow, unlike the test cal- 
culation by Wadsley et al. (2004) using a traditional SPH 
formulation. 



6.3 Isothermal collapse 

Another demanding test problem that couples the evolu- 
tion under self-gravity and hydrodynamics is the 'standard 
isothermal test case' introduced by Boss & Bodenheimer 
(1979). We consider this fragmentation calculation in the 
variant proposed by Burkert & Bodenheimer (1993), where 



a smaller initial non-axisymmetric perturbation is employed; 
this form of the initial conditions has been used in numer- 
ous test calculations since then. The initial state consists of 
a spherical cloud with sound speed c s — 1.66 x 10 4 cms~ 1 
and an isothermal equation of state, P = (? a p. The cloud 
radius is R = 5 x 10 16 cm, its mass is M = IMq, and 
it is in solid body rotation with an angular velocity of 
Lj = 7.2 x 10~ 13 rads _1 . The underlying constant density 
distribution (po = 3.82 x 10~ 18 gcm~ 3 ) is modulated with a 
m = 2 density perturbation, 

p(^) = po[l + O.lcos(20)], (35) 

where <f> is the azimuthal angle around the rotation axis. We 
implement the initial conditions with a sphere of particles 
carved out of a regular grid, where the 10% density pertur- 
bation is achieved with a mass perturbation in the otherwise 
equal-mass particles. 

This simultaneous collapse and fragmentation problem 
requires high spatial resolution and accuracy both in the 
treatment of self-gravity and in the hydrodynamics. A par- 
ticular difficulty is that only a small fraction of the simulated 
mass eventually becomes sufficiently self-gravitating to form 
fragments. As Bate & Burkert (1997) discuss, numerical re- 
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Figure 11. Resolution study for the 'standard isothermal collapse' simulation. We show the gas density in a slice trough the centre of the 
simulated volume at 1.24 free fall times, roughly when the two perturbations at the ends of the bar-like structure become self-gravitating 
and undergo gravitational collapse. From the top left to the bottom row, the particle number increases from 3.3 X 10 4 to 1.71 X 10 7 by 
factors of 8. 



suits are only trustworthy if the Jeans mass is resolved dur- 
ing the calculation. Also, if the gravitational softening is too 
large, collapse may be inhibited and the forming clumps may 
have too large mass. In fact, Sommer-Larsen et al. (1998) 
show that for a finite choice of softening length an arbitrarily 
large mass of gas in pressure equilibrium can be deposited 
in a nonsingular isothermal density distribution with radius 
of order the softening length. On the other hand, a grav- 
itational softening much smaller than the SPH smoothing 
length can lead to artificial clumping of particles. The best 
strategy for this type of fragmentation calculations there- 
fore appear to be to make the gravitational softening equal 
to the SPH softening length, an approach we use in this test 
calculation. While a varying gravitational softening formally 



changes the potential energy of the system, this energy per- 
turbation can be neglected in the highly dissipative isother- 
mal case we consider here. Note that once fragmentation 
occurs, the density rises rapidly on a free fall timescale, and 
the smallest resolved spatial scale as well as the timestep 
drop rapidly. This quickly causes the simulation to stall, 
unless the dense gas is eliminated somehow, for example by 
modelling star formation with sink particles (Bonnell et al., 
1997). 

In Figure 11, we compare the density fields at t = 1.24 
free fall times in the z — plane, orthogonal to the rota- 
tion axis, for four different numerical resolutions, ranging 
from 3.3 x 10 4 to 1.71 x 10 7 . At this time, an elongated 
bar-like structure has formed with two high-density regions 
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Figure 12. Resolution study for the 'standard isothermal col- 
lapse' simulation. We here compare the temporal evolution of 
the maximum density reached in simulations of different particle 
number, as indicated in the legend. Symbols give the SPH result 
(8 X 10 4 particles) of Bate & Burkert (1997), which agrees quite 
well with our result at comparable resolution. The small residual 
differences are plausibly due to differences in the employed SPH 
density estimator or the neighbour number. 



at its ends. Due to a converging gas-flow onto these ends, 
they become eventually self-gravitating and collapse to form 
two fragments. The onset of this collapse can be studied in 
Figure 12, where we plot the maximum density reached in 
the simulation volume as a function of time. It can be seen 
that the three high-resolution computations converge rea- 
sonably well, with a small residual trend towards slightly 
earlier collapse times with higher resolution, something that 
is probably to be expected. The low-resolution run behaves 
qualitatively very similarly, but shows some small oscilla- 
tions in the maximum density in the early phases of the col- 
lapse. Overall, our results compare favourably with those of 
Bate & Burkert (1997), but we are here able to reach higher 
resolution and are also able to reproduce more cleanly a first 
density maximum at t — 1.1, which is also seen in the mesh 
calculations considered by Bate & Burkert (1997). 



6.4 Dark matter halo mass function and 
clustering 

Cosmological simulations of structure formation are the pri- 
mary target of GADGET-2. Because the dominant mass com- 
ponent is dark matter, the accuracy and performance of 
the collisionless N-body algorithms in periodic cosmological 
boxes is of tantamount importance for this science applica- 
tion. To compare results of GADGET-2 to other codes, we 
make use of a recent extensive study by Heitmann et al. 
(2004), who systematically compared the dark matter re- 
sults obtained with a number of different simulation codes 
and techniques. Among the codes tested was also the old 
public version of GADGET- 1 (Springel et al., 2001). As a 
useful service to the community, Heitmann et al. (2004) have 
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Figure 13. Comparison of the differential halo mass function 
obtained with different simulation codes for a 256 3 ACDM simu- 
lation in a 256/i _1 Mpc box. The top panel compares the results 
from GADGET-2 (red symbols with Poisson error bars) with 
those obtained with the old version GADGET-1 (blue circles). 
The bottom panel shows the relative differences with respect to 
GADGET-2 for a larger pool of 6 codes. The evolved density fields 
for the latter have been taken from Heitmann et al. (2004). The 
dashed lines indicate the size of the expected \a scatter due to 
counting statistics. 



made their initial conditions as well as the evolved results 
of their computations publicly available. We here re-analyse 
the dark matter mass function and the two-point autocorre- 
lation function of their data using independent measurement 
code and compare the results with those obtained by us with 
GADGET-2. 

The simulations considered are two runs with 256 3 
particles in periodic boxes of sidelength 64/i _1 Mpc and 
256/i _1 Mpc, respectively, in a Q, m = 0.314, Qa = 0.686 
universe with h — 0.71. Further details about the initial 
conditions are given in Heitmann et al. (2004). We use a 
comoving gravitational softening length equal to 1/35 of the 
mean particle spacing. 

Non-linear gravitational clustering leads to the forma- 
tion of gravitationally bound structures that over time build 
up ever more massive halos. The abundance of halos as a 
function of mass and time is arguably the most important 
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Figure 14. Comparison of different codes with respect to the 
two-point correlation of the evolved density of a 256 3 ACDM 
simulation in a 64 h Mpc box. We show the relative differences 
with respect to GADGET-2 for the group of 6 codes considered by 
Heitmann et al. (2004). The vertical dotted line marks the gravita- 
tional softening length of e = 10 kpc we used for our GADGET-2 
calculation. We explicitly checked that the latter is fully con- 
verged with respect to the time integration and force accuracy 
settings. 

basic result of structure formation calculations. In Figure 13, 
we show the differential halo mass function, computed with 
the standard friends-of-friends (FOF) algorithm using a link- 
ing length equal to 0.2 the mean particle spacing. The top 
panel compares our new GADGET-2 result for the large box 
at z = with the result obtained by Heitmann et al. (2004) 
with GADGET-1. We obtain very good agreement over the 
full mass range. The bottom panel of Fig. 13 extends the 
comparison to the five additional codes examined by Heit- 
mann et al. (2004): The AMR code FLASH (Fryxell et al., 
2000), the parallel tree code HOT (Warren & Salmon, 1995), 
the adaptive P 3 M code HYDRA (Couchman et al., 1995), 
the parallel PM-code MC 2 (Habib et al., in preparation), 
and the tree-PM solver TPM (Bode & Ostriker, 2003). We 
plot the relative halo abundance in each bin, normalised to 
the GADGET-2 result. While there is good agreement for 
the abundance of massive halos within counting statistics, 
systematic differences between the codes become apparent 
on the low mass side. Particularly the codes based purely 
on mesh-based gravity solvers, MC 2 and FLASH, have prob- 
lems here and show a substantial deficit of small structures. 
It is expected that some small halos are lost due to insuf- 
ficient resolution in fixed-mesh codes, an effect that can be 
alleviated by using a sufficiently fine mesh, as MC 2 demon- 
strates. It is worrying however that current adaptive mesh 
refinement (AMR) codes have particularly severe problems 
in this area as well. A similar conclusion was also reached 
independently by O'Shea et al. (2003) in a comparison of the 
AMR code ENZO (O'Shea et al., 2004) with GADGET. As 
gravity is the driving force of structure formation, the novel 
AMR methods clearly need to keep an eye on this issue and 
to improve their gravity solvers when needed, otherwise part 
of the advantage gained by their more accurate treatment 
of hydrodynamics in cosmological simulations may be lost. 
In Figure 14, we show a similar comparison for the two- 



point correlation function of the dark matter in the small 
64/i _1 Mpc box, again normalised to the GADGET-2 results. 
As discussed in more detail by Heitmann et al. (2004), on 
large-scales all codes agree reassuringly well, perhaps even 
better than one might have expected. On small scales, the 
mesh-based codes tend to show a deficit of clustering, con- 
sistent with the results for the mass function. Interestingly, 
the result obtained by Heitmann et al. (2004) for GADGET- 

1 shows a noticeable excess of clustering on very small scales 
compared to our computation with GADGET-2. This hap- 
pens on rather small scales, comparable to the gravitational 
softening scale. This could simply be the result of a differ- 
ent choice of gravitational softening length, but we also be- 
lieve that the GADGET-2 result is the more accurate here. 
As shown by Power et al. (2003), the time integrator of 
GADGET-1 has the property that insufficient time integra- 
tion settings can lead to an increase of the central density in 
halos due to secular integration errors, while for very poor 
timestepping the halo density is eventually suppressed. The 
numerical steepening of the central density profile caused 
by this effect could then show up as a signature of enhanced 
clustering at very small scales, just like what is seen here in 
the GADGET-1 result. 

6.5 Santa Barbara cluster 

In the 'Santa Barbara cluster comparison project' (Frenk 
et al., 1999), a large number of hydrodynamic cosmological 
simulation codes were applied to the same initial conditions, 
which were set up to give rise to the formation of a rich 
cluster of galaxies in a critical density CDM universe, sim- 
ulated using adiabatic gas physics. In total 12 codes were 
compared in this study, including SPH and Eulerian codes, 
both with fixed and adaptive meshes. Each simulation group 
was allowed to downsample the initial conditions in a way 
they considered reasonable, given also their computational 
resources and code abilities, so that the final comparison 
involved computations of different effective resolutions. 

The overall results of this comparison were encouraging 
in the sense that bulk properties of the cluster agreed to 
within ~ 10% and the gas properties were similar in most 
codes, although with large scatter in the inner parts of the 
cluster. However, there have also been some systematic dif- 
ferences in the results, most notably between mesh-based 
and SPH codes. The former showed higher temperatures and 
entropies in the cluster centre than the SPH codes. Also, the 
enclosed gas fraction within the virial radius was systemati- 
cally higher for mesh codes and closer to the universal bary- 
onic fraction, while the SPH codes only found about 90% 
of the universal fraction in the virial radius. Since then, the 
Santa Barbara cluster has been repeatedly used as a test 
problem for cosmological codes, but the question which is 
the 'correct' entropy profile and gas fraction in adiabatic 
cluster has not been settled conclusively so far. 

We have simulated the Santa Barbara cluster at three 
different numerical resolutions (2 x 256 3 , 2 x 128 3 , and 

2 x 64 3 ) with GADGET-2, in each case using a homogeneous 
sampling for the periodic box. Frenk et al. (1999) supplied 
displacement fields at a nominal resolution of 256 3 , which 
we directly used for our high resolution 2 x 256 3 run. The 
initial conditions for the lower resolution runs were con- 
structed by applying a filter in Fourier space to eliminate 
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Figure 15. Radial profiles of the Santa Barbara Cluster. From top left to bottom right, we show spherically averaged profiles of dark 
matter density, gas density, temperature, dark matter velocity dispersion, enclosed gas fraction, and specific entropy. In each case we 
compare results for three different resolutions. The vertical line marks the virial radius of the cluster. The dashed line in the dark matter 
profile is a NFW profile with the parameters given by Frenk et al. (1999). The same profile is also shown in the gas density plot to guide 
the eye (scaled by the baryon to dark matter density ratio). 
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Figure 16. Time evolution of the interaction of a strong shock wave with a dense cloud of gas. The cloud of radius r = 3.5 has an 
initial relative overdensity of 5, and is embedded at pressure equilibrium in ambient gas of unit density and unit pressure. From the left, 
a shock wave with Mach-number M = 10.0 approaches and strikes the cloud. The gas has 7 = 5/3, giving the shock an incident velocity 
of v = 9.586 and a compression factor of 3.884 with respect to the background gas. Each panel shows the gas density in a region of size 
62.5 X 25.0, with the time indicated in the top right corner. The computation assumed periodic boundaries at the top and bottom. 
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modes above the corresponding Nyquist frequencies, in or- 
der to avoid aliasing of power. 

In Figure 15, we compare our results in terms of spher- 
ically averaged profiles for dark matter density, dark matter 
velocity dispersion, gas density, enclosed gas fraction, tem- 
perature, and specific entropy. We use the original binning 
prescriptions of Frenk et al. (1999), and the same axis ranges 
for easier comparison. Our simulations converge quite well 
in all their properties, apart from the innermost bins (we 
have plotted bins if they contained at least 10 dark mat- 
ter or gas particles). However, we note that we confirm the 
finding of Wadsley et al. (2004) that there is a merger hap- 
pening right at z = 0; in fact, in our high-res 2 x 256 3 run, 
the infalling clump is just passing the centre at z — 0, while 
this happens with a slight time offset in the other two runs. 
We have therefore actually plotted the results at expansion 
factor a = 1.02 in Figure 15, where the cluster has relaxed 
again. The results at z = look very similar, only the tem- 
perature, gas entropy, and dark matter velocity dispersion 
at r < 0.1 show larger differences between the simulations. 
As Wadsley et al. (2004) point out, the effects of this unfor- 
tunate timing of the merger presumably also contribute to 
the scatter found in the results of Frenk et al. (1999). 

Our results agree very well with the mean profiles re- 
ported in the Santa Barbara cluster comparison project. 
Our resolution study also suggests that GADGET-2 produces 
quite stable convergence for a clean set of initial conditions 
of different resolution. The mass resolution has been varied 
by a factor 64 and the spatial resolution per dimension by a 
factor 4 in this series; this is already a significant dynamic 
range for 3D simulations, thereby helping to build up trust 
in the robustness of the results of the code. 

The entropy profile our results at small radii (R ~ 0.1) 
appears to lie somewhat above the SPH results reported in 
Frenk et al. (1999) for other SPH codes. This is in line with 
the findings of Ascasibar et al. (2003), and perhaps a con- 
sequence of the entropy-conserving formulation of SPH that 
we have adopted in GADGET-2. Also, the entropy profile ap- 
pears to become slightly shallower at small radii, which sug- 
gests a small difference from the near power-law behaviour 
seen in other SPH codes (see for example the high-res result 
of Wadsley et al., 2004). However, this effect appears to be 
too small to produce the large isentropic cores seen in the 
mesh simulations of Frenk et al. (1999). Such a core has also 
been found in the new AMR code by Quilis (2004). The sys- 
tematic difference between the different simulation methods 
therefore continues to persist. We suggest that it may be 
caused by entropy production due to mixing; this channel 
is absent in the SPH code by construction while it operates 
efficiently in the mesh codes, perhaps even too efficiently. 

Another interesting point to observe is that our SPH 
simulations clearly predict that the enclosed baryon fraction 
is well below the universal baryon fraction at the virial ra- 
dius of the adiabatic cluster. It seems a solid result that our 
results converge at values of around 90%, in clear difference 
with results near ~ 100% predicted by the majority of mesh 
codes in the study by Frenk et al. (1999). However, we note 
that the new AMR code ART of Kravtsov et al. (2005) also 
gives values below the universal baryon fraction, although 
not quite as low as the SPH codes. We can also observe a 
clear break in the profile at ~ 0.6 Mpc, which could not be 
discerned as easily in the results of Frenk et al. (1999). At 




Figure 17. Local gradients in the gas density field at time t = 
4.5, visualised by a grey-scale image with intensity proportional 
to log(|Vp|/p). Clearly visible are the two pairs of primary and 
secondary vortices, as well as the stem of the backflow. The region 
shown has a size of 31.25 X 12.5. 

this radius, the gas profile begins to notably flatten com- 
pared with the dark matter profile. 

6.6 Interaction of a strong shock with a dense gas 
cloud 

As a final hydrodynamical test problem we consider the in- 
teraction of a strong shock wave with an overdense cloud 
embedded at pressure equilibrium in a background gas. This 
can be viewed as a simple model for the interaction of a su- 
pernova blast wave with a dense cloud in the interstellar 
medium. When the shock strikes the cloud, a complicated 
structure of multiple shocks is formed, and vortices are gen- 
erated in the flow around the cloud which lead to its (par- 
tial) destruction. Aside from its physical relevance for simple 
models of the interstellar medium, this makes it an interest- 
ing hydrodynamical test problem. The situation has first 
been studied numerically in a classic paper by Klein et al. 
(1994). Recently, Poludnenko et al. (2002) have readdressed 
this problem with a high-resolution AMR code; they also ex- 
tended their study to cases of multiple clouds and different 
density ratios and shock strengths. 

As initial conditions, we adopt a planar shock wave 
of Mach number M — 10 which enters gas of unit den- 
sity and unit pressure from the negative ^-direction. In the 
frame of the ambient background gas, the shock approaches 
with velocity v = 9.586, leading to a post-shock density 
of p' = 3.884. We adopt a two-dimensional computational 
domain with periodic boundaries in the y-direction, and for- 
mally infinite extension in the x-direction. The boxsize in the 
y-direction is 25 length units, and the radius of the spherical 
cloud of overdensity 5 is r = 3.5. The set-up of SPH par- 
ticles was realized with a glass like particle distribution us- 
ing equal-mass particles. We have first evolved the incident 
shock-wave independently in order to eliminate transients 
that typically arise if it is set-up as a sharp discontinuity, 
i.e. our incident shock is consistent with the SPH smoothing 
scheme. 

In Figure 16, we show density maps of the system at 
different times of its evolution. When the shock strikes the 
cloud, a complicated structure of forward and reverse shocks 
develops. A detailed description of the various hydrodynam- 
ical features of the flow is given by Poludnenko et al. (2002) . 
Two pairs of primary vortices develop in the flow around the 
cloud and start shredding the cloud. This can be seen par- 
ticularly well in the 'Schlieren' image of Figure 17, where 
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Simulation Boxsize (256 3 ) 256/i~ 1 Mpc 64ft- 1 Mpc 



Timesteps 


26 


48 


5794 


Total wallclock time [sec] 


60600 


173700 


Tree walk 


52.8 


% 


41.0 % 


Tree construction 


4.6 


% 


6.4 % 


Tree walk communication 


0.9 


% 


1.6 % 


Work-load imbalance 


6.7 


% 


14.4 % 


Domain decomposition 


13.0 


% 


15.2 % 


PM force 


4.4 


% 


4.9 % 


Particle and tree drifts 


5.3 


% 


4.9 % 


Kicks and timestepping 


1.4 


% 


1.1 % 


Peano keys and ordering 


8.0 


% 


7.8 % 


Misc (I/O, etc.) 


2.9 


% 


2.6 % 



Table 1. CPU- Time consumption in different parts of the code 
for two typical 256 3 dark matter simulations. The initial condi- 
tions for the two simulations are those of Heitmann ct al. (2004). 
We first give the total number of timesteps and the elapsed wall- 
clock time to evolve the simulation to z = on 8 CPUs of a 
Pcntium-IV cluster. The total consumed time is then broken up 
in time spent in different parts of code, as measured by the timing 
routines built into GADGET-2. 

we show a grey-scale map of the local density gradient. 
Overall, our SPH results look similar to the AMR results 
of Poludnenko et al. (2002), but there are also clearly some 
differences in detail. For example, the small 'droplets' of 
gas chopped of from the cloud still survive in the SPH cal- 
culation for a comparatively long time and are not mixed 
efficiently with the background material, a clear difference 
with the mesh-based calculations. 



7 PERFORMANCE AND SCALABILITY 

The performance of a parallel simulation code is a complex 
function of many factors, including the type of physical prob- 
lem studied, the particle and processor numbers employed, 
the choices made for various numerical parameters of the 
code (e.g. time integration settings, maximum allowed mem- 
ory consumption, etc.), and finally of hardware and compiler 
characteristics. This makes it hard to objectively compare 
the performance of different codes, which should ideally be 
done at comparable integration accuracy for the same physi- 
cal system. Given these difficulties, we restrict ourselves to a 
basic characterisation of the performance and scaling prop- 
erties of GADGET-2 without attempting to compare them 
in detail with other simulation codes. 



7.1 Timing measurements for cosmological 
simulations 

In Table 1, we list the total wallclock time elapsed when run- 
ning the two 256 3 dark matter simulations discussed in Sec- 
tion 6.4, based on the initial conditions of Heitmann et al. 
(2004). The measured times are for all tasks of the code, 
including force computations, tree construction, domain de- 
composition, particle drifts, etc. A detailed break-down of 
the relative contributions is given in the table as well. The 
hardware used was a 8 CPU partition on a small cluster 




communication and imbalance 



■ domain decomposition 




Figure 18. Diagram for the time consumption of a rather small 
galaxy collision simulation evolved with a different number of 
processors between 1 and 8. We show a sample of 64 timesteps 
in each case, each represented by a vertical bar with a width 
proportional to the elapsed wall-clock time during this step. Each 
step is additionally subdivided into different constituent parts, 
drawn in different shades of grey as indicated in the legend. 



of Pentium-IV PCs (2.4 GHz clock speed, 2 CPUs per ma- 
chine), using the public MPICH library for communication 
via gigabit ethernet. 

We can see that the CPU consumption is dominated 
by the short-range tree computation, while the PM force 
is subdominant overall. The raw force speed in the short- 
range tree walk of these TreePM simulations (using a 384 3 
mesh) reaches about 21000 forces/sec/processor. This is a 
high number, significantly in excess of what is reached with 
pure tree algorithms. In fact, the latter tend to be signifi- 
cantly slower for this type of simulation, typically by a factor 
4-10. 

Most of the auxiliary tasks of the simulation code, for 
example particle drifting, I/O, and so on, typically require 
a few percent of the total CPU. Some of these tasks arc 
due to the parallelisation strategy, namely the domain de- 
composition, the wait times due to work-load imbalance, and 
the time needed for communication itself. However, provided 
these contributions stay subdominant, we can still expect a 
significantly faster time to solution as a result of parallelisa- 
tion, besides the possibility to carry out larger simulations 
because of the availability of the combined memory of all 
processors. 

In cosmological hydrodynamical TreePM simulations, 
we find that the CPU time required for the SPH computa- 
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Figure 19. Wall-clock times consumed by the test runs of the 
simulation series. An extrapolation to the size of a 2160 3 simu- 
lation suggests that it should require about 1.1 X 10 6 seconds on 
1024 processors. 

tions is roughly equal to that consumed for the short-range 
gravitational tree-forces. This is for example the case in the 
simulations of the Santa Barbara cluster discussed in Sec- 
tion 6.5. The cost of self-gravity is hence comparable to or 
larger than the cost of the hydrodynamical computations in 
GADGET-2. Even in simulations with dissipation this ratio 
shifts only moderately towards a higher relative cost of the 
hydrodynamics, but of course here the total cost of a sim- 
ulation increases substantially because of the much shorter 
dynamical times that need to be resolved. 

7.2 Scalability 

The problem size is an important characteristic when as- 
sessing the performance of a massively parallel simulation 
code. Due to the tight coupling of gravitational problems, it 
is in general not possible to obtain a nearly linear speed-up 
when a small problem is distributed onto many processors. 
There are several reasons that make this impossible in prac- 
tice: (1) There is always some irreducible serial part of the 
code that does not parallelise; this overhead is fixed and 
hence its relative contribution to the total cost keeps get- 
ting larger when the parallel parts are accelerated by using 
more processors. (2) The more processors are used, the less 
work each of them has to do, making it harder to balance the 
work equally among them, such that more and more time is 
lost to idle waiting of processors. (3) When more processors 
are used, a smaller particle-load per processor results, which 
in turn leads to a larger communication-to-compute ratio in 
tightly coupled problems. 

For all of these reasons, perfect scalability at fixed prob- 
lem size can in general not be expected. In Figure 18, we il- 
lustrate this with a rather small galaxy collision simulation, 
consisting of two galaxies with 30000 collisionless particles 
each, distributed into a stellar disk and an extended dark 
matter halo. We have evolved this simulation with GADGET- 
2 using different processor numbers, from 1 to 8. The dia- 



Name 


Acpu 


Apart 




L box [A^Mpc] 


S4 


4 


100 3 


128 3 


23.1 


S8 


8 


146 3 


192 3 


33.8 


S16 


16 


216 3 


256 3 


50.0 


S32 


32 


318 3 


384 3 


73.6 


S64 


64 


464 3 


576 3 


108.0 



Table 2. Simulations done for the scaling test. All runs used 
the same mass- and length resolution of 1.03 X 10 9 h~ 1 Mq and 
5/i _1 kpc, respectively, and were started at Zi n j t = 49. The runs 
used equal settings for force accuracy and time integration pa- 
rameters, and all were asked to produce the same number of out- 
puts, at which point they also did group finding, power spectrum 
estimation and two-point correlation function computation. 

gram in Figure 18 shows the time consumption in different 
parts of the code, during 64 typical steps taken from the 
simulation. Each step is shown with an area proportional 
to the elapsed wall-clock time, and different shades of grey 
are used for different parts of the code within each step. 
In particular, black is used for the actual tree walk, while 
light grey marks losses of some sort or the other (primarily 
wait times due to work-load imbalance, and communication 
times). We see that the relative fraction of this light grey 
area (at the top) relative to the total keeps growing when 
the number of processors is increased. In fact, the scaling is 
disappointing in this example, falling significantly short of 
perfect scaling where the total area for the 64 steps would 
decline as the inverse of the processor number. However, 
this result is not really surprising for such a small problem; 
when typical timesteps last only fractions of a second and 
the particle-load per processor is very low, the problem size 
is simply too small to allow good scaling with GADGET-2's 
massively parallel algorithms. We also see that the widths 
of the different steps follow a particular pattern, stemming 
from the individual timestep integration scheme, where the 
occupancy of certain steps with 'active' particles constantly 
changes. The two large grey bars represent the computa- 
tion of the gravitational potential for all particles, which 
was here carried out in regular intervals to monitor energy 
conservation of the code. 

However, arguably of more practical relevance for as- 
sessing the scaling of the code is to consider its performance 
when not only the processor number but at the same time 
also the problem size is increased. This is of immediate rel- 
evance for practical application of a simulation code, where 
one typically wants to employ large numbers of processors 
only for challengingly large problems, while small problem 
sizes are dealt with correspondingly fewer processors. A si- 
multaneous variation of problem size and processor number 
can alleviate all three of the scaling obstacles listed above. 
However, changing the problem size really means to change 
the physics of the problem, and this aspect can be eas- 
ily confused with bad scaling when analysed superficially. 
For example, increasing the problem size of a simulation of 
cosmological structure formation either improves the mass 
resolution or the volume covered. In both cases, typically 
more simulation timesteps will be required to integrate the 
dynamics, either because of better spatial resolution, or be- 
cause more massive systems of lower space-density can form. 
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The intrinsic computational cost of a simulation therefore 
typically scales (sometimes considerably) faster than linear 
with the problem size. 

With these caveats in mind, we show in Figure 19 the re- 
quired run times for a scaling experiment with cosmological 
ACDM dark matter simulations, carried out with GADGET- 
2 on a cluster of IBM p690 systems. In our series of five sim- 
ulations, we have increased the particle number from 10 6 to 
10 s , in each step by roughly a factor of \/T0. At the same 
time we also doubled the number of processors in each step. 
We kept the mass and spatial resolutions fixed at values of 
10 9 /i _1 M Q and e = 5 ft _1 kpc, respectively, i.e. the volume of 
the simulations was growing in this series. We also increased 
the size of the FFT mesh in lock step with the particle num- 
ber. In Table 2, we list the most important simulation pa- 
rameters, while in Figure 19, we show the total wall-clock 
times measured for evolving each of the simulations from 
high redshift to z = 0, as a function of particle number. We 
note that the measurements include time spent for comput- 
ing on-the-fly friends-of-friends group catalogues, two-point 
correlation functions, and power spectra for 64 outputs gen- 
erated by the runs. However, this amounts only to a few per 
cent of the total CPU time. 

We see that the simulation series in Figure 19 fol- 
lows a power-law. For a perfect scaling, we would expect 
Twaiidock oc TVpart /JVcpu , which would correspond to a power- 
law with slope n — 1— log(4) ~ 0.4 for the series. Instead, the 
actually measured slope (fitted blue line) is n = 0.52, slightly 
steeper. However, the perfect scaling estimate neglects fac- 
tors of log(Af par t) present in various parts of the simulation 
algorithms (e.g. in the tree construction), and also the fact 
that the larger simulations do need more timesteps than the 
smaller ones. In the series, the number of timesteps in fact 
increases by 23% from S4 to S64. Overall, the scaling of the 
code is therefore actually quite good in this test. In fact, an 
extrapolation of the series to 2160 3 ~ 1.0078 x 10 10 parti- 
cles in a 500 /i~ 1 Mpc box suggests that such a simulation 
should be possible on 1024 processors in about 1.1 x 10 6 sec. 
This simulation has in fact been realized with GADGET-2 
in the first half of 2004, finishing on June 14. This 'Millen- 
nium' simulation by the Virgo consortium (Springel et al., 
2005) is the largest calculation carried out with GADGET-2 
thus far, and it is also the largest high-resolution cosmolog- 
ical structure formation simulation at present, reaching a 
dynamic range of 10 5 everywhere in the periodic simulation 
box. The total wall-clock time required for the simulation on 
the 512 processors actually used was slightly below 350,000 
hours, only about 10% more than expected from the above 
extrapolation over two orders of magnitude. This shows that 
GADGET-2 can scale quite well even to very large processor 
partitions if the problem size is sufficiently large as well. 

Finally, in Figure 20 we show the cumulative CPU time 
consumed for the five simulations of the series as a function 
of cosmological scale factor. We have normalised the total 
CPU time consumption, T cpu = T wa n c i ock x iV C p U , to the 
number of particles simulated, such that a measure for the 
computational cost per particle emerges. To first order the 
required CPU times scales roughly linearly with the scale 
factor, and grows to order of a few dozen milliseconds per 
particle. At the time of the test run, the p690 cluster was not 
yet equipped with its fast interconnection network, which led 
to the comparatively poorer performance of the S64 simu- 
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Figure 20. Total elapsed wall-clock time per particle of each test 
run as a function of cosmological scale factor. The elapsed run- 
times of each simulation have been multiplied by the processor 
number, and normalised by the particle number. 

lation as a result of the communication intensive PM part 
taking its toll. On current high-end hardware (which is al- 
ready faster than the p690 machine), GADGET-2 reaches a 
total CPU cost of about 10 ms per dark matter simulation 
particle in realistic simulations of cosmological structure for- 
mation evolved from high redshift to the present. 

7.3 Memory consumption 

The standard version of GADGET-2 in TreePM mode uses 20 
variables for storing each dark matter particle, i.e. 80 bytes 
per particle if single-precision is used. For each SPH parti- 
cle, an additional 21 variables (84 bytes) are occupied. For 
the tree, the code uses 12 variables per node, and for a sec- 
ondary data structure that holds the centre-of-mass velocity 
and maximum SPH smoothing lengths of nodes, another 4 
variables. For a typical clustered particle distribution on av- 
erage about ~ 0.65 nodes per particle are needed, so that the 
memory requirement amounts to about 42 bytes per parti- 
cle. Finally, for the FFTs in the PM component, GADGET-2 
needs 3 variables per mesh-cell, but the ghost-cells required 
around local patches increase this requirement slightly. Tak- 
ing 4 variables per mesh-cell as a conservative upper limit, 
we therefore need up to 16 bytes (or 32 bytes for double 
precision) per mesh-cell for the PM computation. This can 
increase substantially for two-level PM computations, since 
we here not only have to do zero padding but also store the 
Greens function for the high-resolution region. 

While being already reasonably memory-efficient, the 
standard version of GADGET-2 is not yet heavily optimised 
towards a lean memory footprint. This has been changed 
however in a special lean version of the code, where some 
of the code's flexibility was sacrificed in favour of very low 
memory consumption. This version of the code was used 
for the Millennium simulation described above. The mem- 
ory optimisations were necessary to fit the simulation size 
into the aggregated memory of 1 TB available on the su- 
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percomputer partition used. By removing explicit storage 
for long- and short-range accelerations, particle mass and 
particle type, the memory requirement per particle could be 
dropped to 40 bytes, despite the need to use 34-bit numbers 
for labelling each particle with a unique number. The tree 
storage could also be condensed further to about 40 bytes 
per particle. Since the memory for PM and tree parts of 
the gravitational force computation are not needed concur- 
rently, one can hence run a simulation with a peak memory 
consumption of about 80 bytes per particle, provided the 
Fourier mesh is not chosen too large. In practice, one has 
to add to this some additional space for a communication 
buffer. Also, note that particle-load imbalance as a result 
of attempting to equalise the work-load among processors 
can lead to larger than average memory usage on individual 
processors. 



GADGET-2 to the public* . In a time of exponentially grow- 
ing computer power, it remains an ongoing challenge to de- 
velop numerical codes that fully exploit this technological 
progress for the study of interesting astrophysical questions. 
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8 DISCUSSION 

In this paper, we have detailed the numerical algorithms 
used in the new cosmological simulation code GADGET-2, 
and we presented test problems carried out with it. We have 
emphasised the changes made with respect to the previous 
public version of the code. We hope that the improvements 
made in speed, accuracy and flexibility will help future re- 
search with this code by allowing novel types of simulations 
at higher numerical resolution than accessible previously. 

In terms of accuracy, the most important change of 
the code lies in an improved time-integration scheme, which 
is more accurate for Hamiltonian systems at a comparable 
number of integration steps, and in an 'entropy-conserving' 
formulation of SPH, which especially in simulations with ra- 
diative cooling has clear accuracy benefits. Also, large-scale 
gravitational forces are more accurate when the TreePM 
method is used and offer reduced computational cost com- 
pared to a pure tree code. 

In terms of speed, the new code has improved in essen- 
tially all of its parts thanks to a redesign of core algorithms, 
and a complete rewrite of essentially all parts of the sim- 
ulation code. For example, the domain decomposition and 
tree construction have been accelerated by factors of several 
each. Likewise, the SPH neighbour search has been sped up, 
as well as the basic tree-walk, despite the fact that it now 
has to visit many more nodes than before due to the lower 
order of the multipole expansion. 

In terms of flexibility, the code can now be applied to 
more types of systems, for example to zoom simulations with 
a 2-level TreePM approach, or to gas-dynamical simulations 
in two dimensions. GADGET-2 also uses considerably less 
memory than before which makes it more versatile. The code 
can now be run on an arbitrary number of processors, and 
has more options for convenient I/O. Also, the code has 
become more modular and can be more easily extended, 
as evidenced by the array of advanced physical modelling 
already implemented in it, as discussed in Section 2.3. 

In summary, we think GADGET-2 is a useful tool for 
simulation work that will hopefully stimulate further devel- 
opment of numerical codes. To promote this goal, we release 
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